Re: [Scikit-learn-general] Packaging large objects

Lars Buitinck Thu, 21 Feb 2013 15:06:17 -0800

2013/2/21 Ark <[email protected]>:
> Document classification of ~3000 categories with ~12000 documents.
> The number of features comes out to be 500,000 [in which case the joblib
>  classifier dumped is 10g]. If I use SelectKbest to select 200000 best 
> features
> the size comes down to 4.8g maintain the accuracy to 97%. But I am not sure if
> there would be another alternative without sacrificing the accuracy.


Some back of the envelope calculations would show that 3000 * 200000 *
8 (the size of a 64-bit float in bytes) is indeed around 4.8GB.

You could cut that in half by converting coef_ and optionally
intercept_ to np.float32 (that's not officially supported, but with
the current implementation it should work):

    clf.coef_ = np.astype(clf.coef_, np.float32)

You could also try the HashingVectorizer in sklearn.feature_extraction
and see if performance is still acceptable with a small number of
features. That also skips storing the vocabulary, which I imagine will
be quite large as well.

(I hope you meant 12000 document *per class*?)

-- 
Lars Buitinck
Scientific programmer, ILPS
University of Amsterdam

------------------------------------------------------------------------------
Everyone hates slow websites. So do we.
Make your web apps faster with AppDynamics
Download AppDynamics Lite for free today:
http://p.sf.net/sfu/appdyn_d2d_feb
_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] Packaging large objects

Reply via email to