I was just wondering: does the current l1 penalty implementation actually lead to sparse coef_? I though additional tricks were required for that. If it is the case, maybe an example would be nice?
On 02/22/2013 11:15 AM, Peter Prettenhofer wrote: > I just opened a PR for this issue: > https://github.com/scikit-learn/scikit-learn/pull/1702 > > 2013/2/22 Peter Prettenhofer <[email protected]>: >> @ark: for 500K features and 3K classes your coef_ matrix will be: >> 500000 * 3000 * 8 / 1024. / 1024. ~= 11GB >> >> Coef_ is stored as a dense matrix - you might get a considerable >> smaller matrix if you use sparse regularization (keeps most >> coefficients zero) and convert the coef_ array to a scipy sparse >> matrix prior to saving the object - this should cut your store costs >> by a factor of 10-100. >> >> To check the sparsity of ``coef_`` use:: >> >> sparsity = lambda clf: clf.coef_.nonzero()[1].shape[0] / >> float(clf.coef_.size) >> >> To convert the coef_ array do:: >> >> clf = ... # your fitted model >> clf.coef_ = scipy.sparse.csr_matrix(clf.coef_) >> >> >> Prediction doesn't work currently (raises an Error) when coef_ is a >> sparse matrix rather than an numpy array - this is a bug in sklearn >> that should be fixed - I'll submit a PR for this. >> In the meanwhile please convert back to a numpy array or patch the >> SGDClassifier.decision_function method (adding ``dense_output=True`` >> when calling ``safe_sparse_dot`` should do the trick). >> >> best, >> Peter >> >> PS: I strongly recommend using sparse regularization (using >> penatly='l1' or penalty='elasticnet') - this should cut your sparsity >> significantly. >> >> 2013/2/22 Ark <[email protected]>: >>>> You could cut that in half by converting coef_ and optionally >>>> intercept_ to np.float32 (that's not officially supported, but with >>>> the current implementation it should work): >>>> >>>> clf.coef_ = np.astype(clf.coef_, np.float32) >>>> >>>> You could also try the HashingVectorizer in sklearn.feature_extraction >>>> and see if performance is still acceptable with a small number of >>>> features. That also skips storing the vocabulary, which I imagine will >>>> be quite large as well. >>>> >>> HashingVectorizer might indeed save some space...will test for acceptable >>> answer... >>> >>>> (I hope you meant 12000 document *per class*?) >>>> >>> :( Unfortunately, no, I have 12000 documents in all..atleast as a start >>> point, >>> Initially it is just to collect metrics, and as time goes on, mode >>> documents per category will be added. Besides I am also limited on train >>> time >>> which seems to go over hour as the number of samples goes up..[My very first >>> attempt was with 200k documents]. >>> Thanks for the suggestions. >>> >>> >>> >>> >>> >>> ------------------------------------------------------------------------------ >>> Everyone hates slow websites. So do we. >>> Make your web apps faster with AppDynamics >>> Download AppDynamics Lite for free today: >>> http://p.sf.net/sfu/appdyn_d2d_feb >>> _______________________________________________ >>> Scikit-learn-general mailing list >>> [email protected] >>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general >> >> >> -- >> Peter Prettenhofer > > ------------------------------------------------------------------------------ Everyone hates slow websites. So do we. Make your web apps faster with AppDynamics Download AppDynamics Lite for free today: http://p.sf.net/sfu/appdyn_d2d_feb _______________________________________________ Scikit-learn-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
