I think possibly you want the TfidfTransformer, *before* the
HashingVectorizer...BUT...the documentation for the HashingVectorizer
appears to discount the possibility of IDF-weighting:
http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.HashingVectorizer.html
On 7 May 2015 at 17:46, Adam Goodkind <a.goodk...@gmail.com> wrote:
> Hi,
>
> I'm having trouble integrating a HashingVectorizer into a pipeline using
> heterogeneous features. I've tried to construct my pipeline like this:
>
> pipeline = Pipeline([
> # Extract review text and stars
> ('text_stars', TextStarExtractor()),
>
> # Use FeatureUnion to combine features of text and star ratings
> ('union', FeatureUnion(
> transformer_list=[
>
> # Pipeline for pulling tf-idf from review text
> ('review_bow', Pipeline([
> ('selector', ItemSelector(key='body')),
> ('hasher', HashingVectorizer()),
> ('tfidf', TfidfVectorizer()),
> ])),
>
> # Pipeline for pulling ad hoc features from post's body
> ('body_stats', Pipeline([
> ('selector', ItemSelector(key='body')),
> ('stats', TextStats()), # returns a list of dicts
> ('vect', DictVectorizer(sparse=False)), # list of dicts
> -> feature matrix
> ])),
>
> ('star_stats', Pipeline([
> ('selector', ItemSelector(key='stars')),
> ('rating_stats', StarStats()), # returns a list of dicts
> ('star_vect',DictVectorizer(sparse=False)), # list of
> dicts -> feature matrix
> ]))
>
> ],
>
> # weight components in FeatureUnion
> transformer_weights={
> 'review_bow': 0.8,
> 'body_stats': 0.5,
> 'star_stats': 1.0,
> },
> )),
>
> # Use a NSGD classifier
> ('clf', SGDClassifier()),
> ])
>
> parameters = {
> 'clf__alpha': (0.00001, 0.000001),
> }
>
> grid_search = GridSearchCV(pipeline, parameters, verbose=1, cv=3)
> grid_search.fit(data_dcts, training_targets)
>
> It works without the HashingVectorizer. What am I doing wrong? I can
> include the entire code if you'd like.
>
> - Adam
>
> --
> *Adam Goodkind *
> adamgoodkind.com <http://www.adamgoodkind.com>
> @adamgreatkind <https://twitter.com/#!/adamgreatkind>
>
>
> ------------------------------------------------------------------------------
> One dashboard for servers and applications across Physical-Virtual-Cloud
> Widest out-of-the-box monitoring support with 50+ applications
> Performance metrics, stats and reports that give you Actionable Insights
> Deep dive visibility with transaction tracing using APM Insight.
> http://ad.doubleclick.net/ddm/clk/290420510;117567292;y
> _______________________________________________
> Scikit-learn-general mailing list
> Scikit-learn-general@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>
>
------------------------------------------------------------------------------
One dashboard for servers and applications across Physical-Virtual-Cloud
Widest out-of-the-box monitoring support with 50+ applications
Performance metrics, stats and reports that give you Actionable Insights
Deep dive visibility with transaction tracing using APM Insight.
http://ad.doubleclick.net/ddm/clk/290420510;117567292;y
_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general