2013/2/25 Ark <[email protected]>:
> Due to a very large number of features(and reduce the size), I use SelectKBest
> which selects 150k features from the 500k features that I get from
> TfIdfVectorizer, which worked fine. When I use Hashing vectorizer instead of
> TfidfVectorizer I see following warnings runtime:
>
> Output:
> Extracting 150000 best features by a SelectKBest
> /home/n7/env/lib/python2.6/site-packages/scipy/stats/stats.py:3350:
> RuntimeWarning: invalid value encountered in divide
>   chisq = np.add.reduce((f_obs-f_exp)**2 / f_exp)
> /home/n7/env/lib/python2.6/site-
> packages/sklearn/feature_selection/univariate_selection.py:327: UserWarning:
> Duplicate scores. Result may depend on feature ordering.There are probably
> duplicate features, or you used a classification score for a regression task.
>   warn("Duplicate scores. Result may depend on feature ordering."
>
> Although I do not see any significant changes in accuracy, why would the war?

Equal scores due to equal frequencies of features. This is bound to
occur with large k in SelectKBest, and it's pretty harmless. If you
really want to get rid of them...

>  vectorizer = HashingVectorizer(n_features=450000,
>                                        stop_words='english',
>                                        ngram_range=(1, 2))
>  data_vectors = vectorizer.fit_transform(train_data)

... then you can try insert a TfidfTransformer right here:

data_vectors = TfidfTransformer().fit_transform(data_vectors)

The idf factor in particular is likely to get rid of a lot of the duplicates.

>  ch2 = SelectKBest(chi2, k=select_features)
>  data_vectors = ch2.fit_transform(data_vectors, target)

-- 
Lars Buitinck
Scientific programmer, ILPS
University of Amsterdam

------------------------------------------------------------------------------
Everyone hates slow websites. So do we.
Make your web apps faster with AppDynamics
Download AppDynamics Lite for free today:
http://p.sf.net/sfu/appdyn_d2d_feb
_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Reply via email to