> The problem with FeatureUnion is that it can only combine the output of two > transformers. I think it would be great to have a simple method of combining > the result of a transformer with extrenal/untransformed data within a > pipeline.
I think it would be nice if the FeatureUnion makes it easy to extract only certain parts of the input for each transformer. https://github.com/scikit-learn/scikit-learn/issues/2034 intends to cover this issue, but we haven't resolved a clean API. Suggestions are welcome! - Joel On 28 February 2014 03:10, Lars Buitinck <larsm...@gmail.com> wrote: > 2014-02-27 8:33 GMT+01:00 michael kneier <michael.kne...@gmail.com>: >> I would like to add a "combiner" class which would work with pipeline to >> allow users to augment the output of scikit's text feature extraction >> process (or other feature extraction processes). For example, after apply >> CountVectorizer, it is sometime desirable to augment the resulting dataset >> with additional features. Unless I am missing something, this is not easily >> done if the count vectorization is being used in a pipeline, especially if >> CountVectorizer parameters such as min_df are being optimized along with >> downstream model parameters. > > CountVectorizer is very customizable. You can give a custom analyzer > that extracts the features you want: > > CountVectorizer(analyzer=features) > > where features is some custom function that gets either a filename or > a file's content (as a string) and returns whatever features you want. > The only downside is that all the features are going to be counted, so > things like timestamps aren't going to be handled nicely. > > If that doesn't do the trick, have a look at DictVectorizer. That's > even more flexible: you give it dicts mapping feature names to > (numeric or string) values. It will build a matrix representation > using booleans in place of string values, but it will leave the > numeric values untouched. > > ------------------------------------------------------------------------------ > Flow-based real-time traffic analytics software. Cisco certified tool. > Monitor traffic, SLAs, QoS, Medianet, WAAS etc. with NetFlow Analyzer > Customize your own dashboards, set traffic alerts and generate reports. > Network behavioral analysis & security monitoring. All-in-one tool. > http://pubads.g.doubleclick.net/gampad/clk?id=126839071&iu=/4140/ostg.clktrk > _______________________________________________ > Scikit-learn-general mailing list > Scikit-learn-general@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/scikit-learn-general ------------------------------------------------------------------------------ Flow-based real-time traffic analytics software. Cisco certified tool. Monitor traffic, SLAs, QoS, Medianet, WAAS etc. with NetFlow Analyzer Customize your own dashboards, set traffic alerts and generate reports. Network behavioral analysis & security monitoring. All-in-one tool. http://pubads.g.doubleclick.net/gampad/clk?id=126839071&iu=/4140/ostg.clktrk _______________________________________________ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general