> The problem with FeatureUnion is that it can only combine the output of two 
> transformers. I think it would be great to have a simple method of combining 
> the result of a transformer with extrenal/untransformed data within a 
> pipeline.

I think it would be nice if the FeatureUnion makes it easy to extract
only certain parts of the input for each transformer.
https://github.com/scikit-learn/scikit-learn/issues/2034 intends to
cover this issue, but we haven't resolved a clean API.

Suggestions are welcome!

- Joel

On 28 February 2014 03:10, Lars Buitinck <larsm...@gmail.com> wrote:
> 2014-02-27 8:33 GMT+01:00 michael kneier <michael.kne...@gmail.com>:
>> I would like to add a "combiner" class which would work with pipeline to 
>> allow users to augment the output of scikit's text feature extraction 
>> process (or other feature extraction processes). For example, after apply 
>> CountVectorizer, it is sometime desirable to augment the resulting dataset 
>> with additional features. Unless I am missing something, this is not easily 
>> done if the count vectorization is being used in a pipeline, especially if 
>> CountVectorizer parameters such as min_df are being optimized along with 
>> downstream model parameters.
>
> CountVectorizer is very customizable. You can give a custom analyzer
> that extracts the features you want:
>
>     CountVectorizer(analyzer=features)
>
> where features is some custom function that gets either a filename or
> a file's content (as a string) and returns whatever features you want.
> The only downside is that all the features are going to be counted, so
> things like timestamps aren't going to be handled nicely.
>
> If that doesn't do the trick, have a look at DictVectorizer. That's
> even more flexible: you give it dicts mapping feature names to
> (numeric or string) values. It will build a matrix representation
> using booleans in place of string values, but it will leave the
> numeric values untouched.
>
> ------------------------------------------------------------------------------
> Flow-based real-time traffic analytics software. Cisco certified tool.
> Monitor traffic, SLAs, QoS, Medianet, WAAS etc. with NetFlow Analyzer
> Customize your own dashboards, set traffic alerts and generate reports.
> Network behavioral analysis & security monitoring. All-in-one tool.
> http://pubads.g.doubleclick.net/gampad/clk?id=126839071&iu=/4140/ostg.clktrk
> _______________________________________________
> Scikit-learn-general mailing list
> Scikit-learn-general@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

------------------------------------------------------------------------------
Flow-based real-time traffic analytics software. Cisco certified tool.
Monitor traffic, SLAs, QoS, Medianet, WAAS etc. with NetFlow Analyzer
Customize your own dashboards, set traffic alerts and generate reports.
Network behavioral analysis & security monitoring. All-in-one tool.
http://pubads.g.doubleclick.net/gampad/clk?id=126839071&iu=/4140/ostg.clktrk
_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Reply via email to