2013/2/25 Ark <[email protected]>: > Due to a very large number of features(and reduce the size), I use SelectKBest > which selects 150k features from the 500k features that I get from > TfIdfVectorizer, which worked fine. When I use Hashing vectorizer instead of > TfidfVectorizer I see following warnings runtime: > > Output: > Extracting 150000 best features by a SelectKBest > /home/n7/env/lib/python2.6/site-packages/scipy/stats/stats.py:3350: > RuntimeWarning: invalid value encountered in divide > chisq = np.add.reduce((f_obs-f_exp)**2 / f_exp) > /home/n7/env/lib/python2.6/site- > packages/sklearn/feature_selection/univariate_selection.py:327: UserWarning: > Duplicate scores. Result may depend on feature ordering.There are probably > duplicate features, or you used a classification score for a regression task. > warn("Duplicate scores. Result may depend on feature ordering." > > Although I do not see any significant changes in accuracy, why would the war?
Equal scores due to equal frequencies of features. This is bound to occur with large k in SelectKBest, and it's pretty harmless. If you really want to get rid of them... > vectorizer = HashingVectorizer(n_features=450000, > stop_words='english', > ngram_range=(1, 2)) > data_vectors = vectorizer.fit_transform(train_data) ... then you can try insert a TfidfTransformer right here: data_vectors = TfidfTransformer().fit_transform(data_vectors) The idf factor in particular is likely to get rid of a lot of the duplicates. > ch2 = SelectKBest(chi2, k=select_features) > data_vectors = ch2.fit_transform(data_vectors, target) -- Lars Buitinck Scientific programmer, ILPS University of Amsterdam ------------------------------------------------------------------------------ Everyone hates slow websites. So do we. Make your web apps faster with AppDynamics Download AppDynamics Lite for free today: http://p.sf.net/sfu/appdyn_d2d_feb _______________________________________________ Scikit-learn-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
