Thanks Fred, that was the issue. I had to change tfidfvectorizer to
tfidftransfrmer. I thought I couldn't use idf either, but according to this
example
http://scikit-learn.org/stable/auto_examples/text/document_clustering.html#example-text-document-clustering-py
it can be done.
- Adam
On Thu, May 7, 2015 at 8:53 PM, Fred Mailhot <fred.mail...@gmail.com> wrote:
> I think possibly you want the TfidfTransformer, *before* the
> HashingVectorizer...BUT...the documentation for the HashingVectorizer
> appears to discount the possibility of IDF-weighting:
>
>
> http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.HashingVectorizer.html
>
>
> On 7 May 2015 at 17:46, Adam Goodkind <a.goodk...@gmail.com> wrote:
>
>> Hi,
>>
>> I'm having trouble integrating a HashingVectorizer into a pipeline using
>> heterogeneous features. I've tried to construct my pipeline like this:
>>
>> pipeline = Pipeline([
>> # Extract review text and stars
>> ('text_stars', TextStarExtractor()),
>>
>> # Use FeatureUnion to combine features of text and star ratings
>> ('union', FeatureUnion(
>> transformer_list=[
>>
>> # Pipeline for pulling tf-idf from review text
>> ('review_bow', Pipeline([
>> ('selector', ItemSelector(key='body')),
>> ('hasher', HashingVectorizer()),
>> ('tfidf', TfidfVectorizer()),
>> ])),
>>
>> # Pipeline for pulling ad hoc features from post's body
>> ('body_stats', Pipeline([
>> ('selector', ItemSelector(key='body')),
>> ('stats', TextStats()), # returns a list of dicts
>> ('vect', DictVectorizer(sparse=False)), # list of dicts
>> -> feature matrix
>> ])),
>>
>> ('star_stats', Pipeline([
>> ('selector', ItemSelector(key='stars')),
>> ('rating_stats', StarStats()), # returns a list of dicts
>> ('star_vect',DictVectorizer(sparse=False)), # list of
>> dicts -> feature matrix
>> ]))
>>
>> ],
>>
>> # weight components in FeatureUnion
>> transformer_weights={
>> 'review_bow': 0.8,
>> 'body_stats': 0.5,
>> 'star_stats': 1.0,
>> },
>> )),
>>
>> # Use a NSGD classifier
>> ('clf', SGDClassifier()),
>> ])
>>
>> parameters = {
>> 'clf__alpha': (0.00001, 0.000001),
>> }
>>
>> grid_search = GridSearchCV(pipeline, parameters, verbose=1, cv=3)
>> grid_search.fit(data_dcts, training_targets)
>>
>> It works without the HashingVectorizer. What am I doing wrong? I can
>> include the entire code if you'd like.
>>
>> - Adam
>>
>> --
>> *Adam Goodkind *
>> adamgoodkind.com <http://www.adamgoodkind.com>
>> @adamgreatkind <https://twitter.com/#!/adamgreatkind>
>>
>>
>> ------------------------------------------------------------------------------
>> One dashboard for servers and applications across Physical-Virtual-Cloud
>> Widest out-of-the-box monitoring support with 50+ applications
>> Performance metrics, stats and reports that give you Actionable Insights
>> Deep dive visibility with transaction tracing using APM Insight.
>> http://ad.doubleclick.net/ddm/clk/290420510;117567292;y
>> _______________________________________________
>> Scikit-learn-general mailing list
>> Scikit-learn-general@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>
>>
>
>
> ------------------------------------------------------------------------------
> One dashboard for servers and applications across Physical-Virtual-Cloud
> Widest out-of-the-box monitoring support with 50+ applications
> Performance metrics, stats and reports that give you Actionable Insights
> Deep dive visibility with transaction tracing using APM Insight.
> http://ad.doubleclick.net/ddm/clk/290420510;117567292;y
> _______________________________________________
> Scikit-learn-general mailing list
> Scikit-learn-general@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>
>
--
*Adam Goodkind *
adamgoodkind.com <http://www.adamgoodkind.com>
@adamgreatkind <https://twitter.com/#!/adamgreatkind>
------------------------------------------------------------------------------
One dashboard for servers and applications across Physical-Virtual-Cloud
Widest out-of-the-box monitoring support with 50+ applications
Performance metrics, stats and reports that give you Actionable Insights
Deep dive visibility with transaction tracing using APM Insight.
http://ad.doubleclick.net/ddm/clk/290420510;117567292;y
_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general