I think possibly you want the TfidfTransformer, *before* the
HashingVectorizer...BUT...the documentation for the HashingVectorizer
appears to discount the possibility of IDF-weighting:

http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.HashingVectorizer.html


On 7 May 2015 at 17:46, Adam Goodkind <a.goodk...@gmail.com> wrote:

> Hi,
>
> I'm having trouble integrating a HashingVectorizer into a pipeline using
> heterogeneous features. I've tried to construct my pipeline like this:
>
> pipeline = Pipeline([
>     # Extract review text and stars
>     ('text_stars', TextStarExtractor()),
>
>     # Use FeatureUnion to combine features of text and star ratings
>     ('union', FeatureUnion(
>         transformer_list=[
>
>             # Pipeline for pulling tf-idf from review text
>             ('review_bow', Pipeline([
>                 ('selector', ItemSelector(key='body')),
>                 ('hasher', HashingVectorizer()),
>                 ('tfidf', TfidfVectorizer()),
>             ])),
>
>             # Pipeline for pulling ad hoc features from post's body
>             ('body_stats', Pipeline([
>                 ('selector', ItemSelector(key='body')),
>                 ('stats', TextStats()),  # returns a list of dicts
>                 ('vect', DictVectorizer(sparse=False)),  # list of dicts
> -> feature matrix
>             ])),
>
>             ('star_stats', Pipeline([
>                 ('selector', ItemSelector(key='stars')),
>                 ('rating_stats', StarStats()),  # returns a list of dicts
>                 ('star_vect',DictVectorizer(sparse=False)),  # list of
> dicts -> feature matrix
>             ]))
>
>         ],
>
>         # weight components in FeatureUnion
>         transformer_weights={
>             'review_bow': 0.8,
>             'body_stats': 0.5,
>             'star_stats': 1.0,
>         },
>     )),
>
>     # Use a NSGD classifier
>     ('clf', SGDClassifier()),
> ])
>
> parameters = {
>     'clf__alpha': (0.00001, 0.000001),
> }
>
> grid_search = GridSearchCV(pipeline, parameters, verbose=1, cv=3)
> grid_search.fit(data_dcts, training_targets)
>
> It works without the HashingVectorizer. What am I doing wrong? I can
> include the entire code if you'd like.
>
> - Adam
>
> --
> *Adam Goodkind *
> adamgoodkind.com <http://www.adamgoodkind.com>
> @adamgreatkind <https://twitter.com/#!/adamgreatkind>
>
>
> ------------------------------------------------------------------------------
> One dashboard for servers and applications across Physical-Virtual-Cloud
> Widest out-of-the-box monitoring support with 50+ applications
> Performance metrics, stats and reports that give you Actionable Insights
> Deep dive visibility with transaction tracing using APM Insight.
> http://ad.doubleclick.net/ddm/clk/290420510;117567292;y
> _______________________________________________
> Scikit-learn-general mailing list
> Scikit-learn-general@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>
>
------------------------------------------------------------------------------
One dashboard for servers and applications across Physical-Virtual-Cloud 
Widest out-of-the-box monitoring support with 50+ applications
Performance metrics, stats and reports that give you Actionable Insights
Deep dive visibility with transaction tracing using APM Insight.
http://ad.doubleclick.net/ddm/clk/290420510;117567292;y
_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Reply via email to