2013/2/21 Ark <[email protected]>:
> Document classification of ~3000 categories with ~12000 documents.
> The number of features comes out to be 500,000 [in which case the joblib
> classifier dumped is 10g]. If I use SelectKbest to select 200000 best
> features
> the size comes down to 4.8g maintain the accuracy to 97%. But I am not sure if
> there would be another alternative without sacrificing the accuracy.
Some back of the envelope calculations would show that 3000 * 200000 *
8 (the size of a 64-bit float in bytes) is indeed around 4.8GB.
You could cut that in half by converting coef_ and optionally
intercept_ to np.float32 (that's not officially supported, but with
the current implementation it should work):
clf.coef_ = np.astype(clf.coef_, np.float32)
You could also try the HashingVectorizer in sklearn.feature_extraction
and see if performance is still acceptable with a small number of
features. That also skips storing the vocabulary, which I imagine will
be quite large as well.
(I hope you meant 12000 document *per class*?)
--
Lars Buitinck
Scientific programmer, ILPS
University of Amsterdam
------------------------------------------------------------------------------
Everyone hates slow websites. So do we.
Make your web apps faster with AppDynamics
Download AppDynamics Lite for free today:
http://p.sf.net/sfu/appdyn_d2d_feb
_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general