@ark: for 500K features and 3K classes your coef_ matrix will be: 500000 * 3000 * 8 / 1024. / 1024. ~= 11GB
Coef_ is stored as a dense matrix - you might get a considerable smaller matrix if you use sparse regularization (keeps most coefficients zero) and convert the coef_ array to a scipy sparse matrix prior to saving the object - this should cut your store costs by a factor of 10-100. To check the sparsity of ``coef_`` use:: sparsity = lambda clf: clf.coef_.nonzero()[1].shape[0] / float(clf.coef_.size) To convert the coef_ array do:: clf = ... # your fitted model clf.coef_ = scipy.sparse.csr_matrix(clf.coef_) Prediction doesn't work currently (raises an Error) when coef_ is a sparse matrix rather than an numpy array - this is a bug in sklearn that should be fixed - I'll submit a PR for this. In the meanwhile please convert back to a numpy array or patch the SGDClassifier.decision_function method (adding ``dense_output=True`` when calling ``safe_sparse_dot`` should do the trick). best, Peter PS: I strongly recommend using sparse regularization (using penatly='l1' or penalty='elasticnet') - this should cut your sparsity significantly. 2013/2/22 Ark <[email protected]>: >> >> You could cut that in half by converting coef_ and optionally >> intercept_ to np.float32 (that's not officially supported, but with >> the current implementation it should work): >> >> clf.coef_ = np.astype(clf.coef_, np.float32) >> >> You could also try the HashingVectorizer in sklearn.feature_extraction >> and see if performance is still acceptable with a small number of >> features. That also skips storing the vocabulary, which I imagine will >> be quite large as well. >> > HashingVectorizer might indeed save some space...will test for acceptable > answer... > >> (I hope you meant 12000 document *per class*?) >> > :( Unfortunately, no, I have 12000 documents in all..atleast as a start > point, > Initially it is just to collect metrics, and as time goes on, mode > documents per category will be added. Besides I am also limited on train time > which seems to go over hour as the number of samples goes up..[My very first > attempt was with 200k documents]. > Thanks for the suggestions. > > > > > > ------------------------------------------------------------------------------ > Everyone hates slow websites. So do we. > Make your web apps faster with AppDynamics > Download AppDynamics Lite for free today: > http://p.sf.net/sfu/appdyn_d2d_feb > _______________________________________________ > Scikit-learn-general mailing list > [email protected] > https://lists.sourceforge.net/lists/listinfo/scikit-learn-general -- Peter Prettenhofer ------------------------------------------------------------------------------ Everyone hates slow websites. So do we. Make your web apps faster with AppDynamics Download AppDynamics Lite for free today: http://p.sf.net/sfu/appdyn_d2d_feb _______________________________________________ Scikit-learn-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
