Thanks Fred, that was the issue. I had to change tfidfvectorizer to
tfidftransfrmer. I thought I couldn't use idf either, but according to this
example
http://scikit-learn.org/stable/auto_examples/text/document_clustering.html#example-text-document-clustering-py

it can be done.

- Adam

On Thu, May 7, 2015 at 8:53 PM, Fred Mailhot <fred.mail...@gmail.com> wrote:

> I think possibly you want the TfidfTransformer, *before* the
> HashingVectorizer...BUT...the documentation for the HashingVectorizer
> appears to discount the possibility of IDF-weighting:
>
>
> http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.HashingVectorizer.html
>
>
> On 7 May 2015 at 17:46, Adam Goodkind <a.goodk...@gmail.com> wrote:
>
>> Hi,
>>
>> I'm having trouble integrating a HashingVectorizer into a pipeline using
>> heterogeneous features. I've tried to construct my pipeline like this:
>>
>> pipeline = Pipeline([
>>     # Extract review text and stars
>>     ('text_stars', TextStarExtractor()),
>>
>>     # Use FeatureUnion to combine features of text and star ratings
>>     ('union', FeatureUnion(
>>         transformer_list=[
>>
>>             # Pipeline for pulling tf-idf from review text
>>             ('review_bow', Pipeline([
>>                 ('selector', ItemSelector(key='body')),
>>                 ('hasher', HashingVectorizer()),
>>                 ('tfidf', TfidfVectorizer()),
>>             ])),
>>
>>             # Pipeline for pulling ad hoc features from post's body
>>             ('body_stats', Pipeline([
>>                 ('selector', ItemSelector(key='body')),
>>                 ('stats', TextStats()),  # returns a list of dicts
>>                 ('vect', DictVectorizer(sparse=False)),  # list of dicts
>> -> feature matrix
>>             ])),
>>
>>             ('star_stats', Pipeline([
>>                 ('selector', ItemSelector(key='stars')),
>>                 ('rating_stats', StarStats()),  # returns a list of dicts
>>                 ('star_vect',DictVectorizer(sparse=False)),  # list of
>> dicts -> feature matrix
>>             ]))
>>
>>         ],
>>
>>         # weight components in FeatureUnion
>>         transformer_weights={
>>             'review_bow': 0.8,
>>             'body_stats': 0.5,
>>             'star_stats': 1.0,
>>         },
>>     )),
>>
>>     # Use a NSGD classifier
>>     ('clf', SGDClassifier()),
>> ])
>>
>> parameters = {
>>     'clf__alpha': (0.00001, 0.000001),
>> }
>>
>> grid_search = GridSearchCV(pipeline, parameters, verbose=1, cv=3)
>> grid_search.fit(data_dcts, training_targets)
>>
>> It works without the HashingVectorizer. What am I doing wrong? I can
>> include the entire code if you'd like.
>>
>> - Adam
>>
>> --
>> *Adam Goodkind *
>> adamgoodkind.com <http://www.adamgoodkind.com>
>> @adamgreatkind <https://twitter.com/#!/adamgreatkind>
>>
>>
>> ------------------------------------------------------------------------------
>> One dashboard for servers and applications across Physical-Virtual-Cloud
>> Widest out-of-the-box monitoring support with 50+ applications
>> Performance metrics, stats and reports that give you Actionable Insights
>> Deep dive visibility with transaction tracing using APM Insight.
>> http://ad.doubleclick.net/ddm/clk/290420510;117567292;y
>> _______________________________________________
>> Scikit-learn-general mailing list
>> Scikit-learn-general@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>
>>
>
>
> ------------------------------------------------------------------------------
> One dashboard for servers and applications across Physical-Virtual-Cloud
> Widest out-of-the-box monitoring support with 50+ applications
> Performance metrics, stats and reports that give you Actionable Insights
> Deep dive visibility with transaction tracing using APM Insight.
> http://ad.doubleclick.net/ddm/clk/290420510;117567292;y
> _______________________________________________
> Scikit-learn-general mailing list
> Scikit-learn-general@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>
>


-- 
*Adam Goodkind *
adamgoodkind.com <http://www.adamgoodkind.com>
@adamgreatkind <https://twitter.com/#!/adamgreatkind>
------------------------------------------------------------------------------
One dashboard for servers and applications across Physical-Virtual-Cloud 
Widest out-of-the-box monitoring support with 50+ applications
Performance metrics, stats and reports that give you Actionable Insights
Deep dive visibility with transaction tracing using APM Insight.
http://ad.doubleclick.net/ddm/clk/290420510;117567292;y
_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Reply via email to