Hi,

I'm having trouble integrating a HashingVectorizer into a pipeline using
heterogeneous features. I've tried to construct my pipeline like this:

pipeline = Pipeline([
    # Extract review text and stars
    ('text_stars', TextStarExtractor()),

    # Use FeatureUnion to combine features of text and star ratings
    ('union', FeatureUnion(
        transformer_list=[

            # Pipeline for pulling tf-idf from review text
            ('review_bow', Pipeline([
                ('selector', ItemSelector(key='body')),
                ('hasher', HashingVectorizer()),
                ('tfidf', TfidfVectorizer()),
            ])),

            # Pipeline for pulling ad hoc features from post's body
            ('body_stats', Pipeline([
                ('selector', ItemSelector(key='body')),
                ('stats', TextStats()),  # returns a list of dicts
                ('vect', DictVectorizer(sparse=False)),  # list of dicts ->
feature matrix
            ])),

            ('star_stats', Pipeline([
                ('selector', ItemSelector(key='stars')),
                ('rating_stats', StarStats()),  # returns a list of dicts
                ('star_vect',DictVectorizer(sparse=False)),  # list of
dicts -> feature matrix
            ]))

        ],

        # weight components in FeatureUnion
        transformer_weights={
            'review_bow': 0.8,
            'body_stats': 0.5,
            'star_stats': 1.0,
        },
    )),

    # Use a NSGD classifier
    ('clf', SGDClassifier()),
])

parameters = {
    'clf__alpha': (0.00001, 0.000001),
}

grid_search = GridSearchCV(pipeline, parameters, verbose=1, cv=3)
grid_search.fit(data_dcts, training_targets)

It works without the HashingVectorizer. What am I doing wrong? I can
include the entire code if you'd like.

- Adam

-- 
*Adam Goodkind *
adamgoodkind.com <http://www.adamgoodkind.com>
@adamgreatkind <https://twitter.com/#!/adamgreatkind>
------------------------------------------------------------------------------
One dashboard for servers and applications across Physical-Virtual-Cloud 
Widest out-of-the-box monitoring support with 50+ applications
Performance metrics, stats and reports that give you Actionable Insights
Deep dive visibility with transaction tracing using APM Insight.
http://ad.doubleclick.net/ddm/clk/290420510;117567292;y
_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Reply via email to