I just opened a PR for this issue: https://github.com/scikit-learn/scikit-learn/pull/1702
2013/2/22 Peter Prettenhofer <[email protected]>: > @ark: for 500K features and 3K classes your coef_ matrix will be: > 500000 * 3000 * 8 / 1024. / 1024. ~= 11GB > > Coef_ is stored as a dense matrix - you might get a considerable > smaller matrix if you use sparse regularization (keeps most > coefficients zero) and convert the coef_ array to a scipy sparse > matrix prior to saving the object - this should cut your store costs > by a factor of 10-100. > > To check the sparsity of ``coef_`` use:: > > sparsity = lambda clf: clf.coef_.nonzero()[1].shape[0] / float(clf.coef_.size) > > To convert the coef_ array do:: > > clf = ... # your fitted model > clf.coef_ = scipy.sparse.csr_matrix(clf.coef_) > > > Prediction doesn't work currently (raises an Error) when coef_ is a > sparse matrix rather than an numpy array - this is a bug in sklearn > that should be fixed - I'll submit a PR for this. > In the meanwhile please convert back to a numpy array or patch the > SGDClassifier.decision_function method (adding ``dense_output=True`` > when calling ``safe_sparse_dot`` should do the trick). > > best, > Peter > > PS: I strongly recommend using sparse regularization (using > penatly='l1' or penalty='elasticnet') - this should cut your sparsity > significantly. > > 2013/2/22 Ark <[email protected]>: >>> >>> You could cut that in half by converting coef_ and optionally >>> intercept_ to np.float32 (that's not officially supported, but with >>> the current implementation it should work): >>> >>> clf.coef_ = np.astype(clf.coef_, np.float32) >>> >>> You could also try the HashingVectorizer in sklearn.feature_extraction >>> and see if performance is still acceptable with a small number of >>> features. That also skips storing the vocabulary, which I imagine will >>> be quite large as well. >>> >> HashingVectorizer might indeed save some space...will test for acceptable >> answer... >> >>> (I hope you meant 12000 document *per class*?) >>> >> :( Unfortunately, no, I have 12000 documents in all..atleast as a start >> point, >> Initially it is just to collect metrics, and as time goes on, mode >> documents per category will be added. Besides I am also limited on train time >> which seems to go over hour as the number of samples goes up..[My very first >> attempt was with 200k documents]. >> Thanks for the suggestions. >> >> >> >> >> >> ------------------------------------------------------------------------------ >> Everyone hates slow websites. So do we. >> Make your web apps faster with AppDynamics >> Download AppDynamics Lite for free today: >> http://p.sf.net/sfu/appdyn_d2d_feb >> _______________________________________________ >> Scikit-learn-general mailing list >> [email protected] >> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general > > > > -- > Peter Prettenhofer -- Peter Prettenhofer ------------------------------------------------------------------------------ Everyone hates slow websites. So do we. Make your web apps faster with AppDynamics Download AppDynamics Lite for free today: http://p.sf.net/sfu/appdyn_d2d_feb _______________________________________________ Scikit-learn-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
