> You could also try the HashingVectorizer in sklearn.feature_extraction
> and see if performance is still acceptable with a small number of
> features. That also skips storing the vocabulary, which I imagine will
> be quite large as well.
>

Due to a very large number of features(and reduce the size), I use SelectKBest 
which selects 150k features from the 500k features that I get from 
TfIdfVectorizer, which worked fine. When I use Hashing vectorizer instead of 
TfidfVectorizer I see following warnings runtime:

Output:
Extracting 150000 best features by a SelectKBest
/home/n7/env/lib/python2.6/site-packages/scipy/stats/stats.py:3350: 
RuntimeWarning: invalid value encountered in divide
  chisq = np.add.reduce((f_obs-f_exp)**2 / f_exp)
/home/n7/env/lib/python2.6/site-
packages/sklearn/feature_selection/univariate_selection.py:327: UserWarning: 
Duplicate scores. Result may depend on feature ordering.There are probably 
duplicate features, or you used a classification score for a regression task.
  warn("Duplicate scores. Result may depend on feature ordering."

Although I do not see any significant changes in accuracy, why would the war? 


 vectorizer = HashingVectorizer(n_features=450000,
                                       stop_words='english',
                                       ngram_range=(1, 2))
 data_vectors = vectorizer.fit_transform(train_data)
 ch2 = SelectKBest(chi2, k=select_features)
 data_vectors = ch2.fit_transform(data_vectors, target)
 





------------------------------------------------------------------------------
Everyone hates slow websites. So do we.
Make your web apps faster with AppDynamics
Download AppDynamics Lite for free today:
http://p.sf.net/sfu/appdyn_d2d_feb
_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Reply via email to