I just opened a PR for this issue:
https://github.com/scikit-learn/scikit-learn/pull/1702

2013/2/22 Peter Prettenhofer <[email protected]>:
> @ark: for 500K features and 3K classes your coef_ matrix will be:
> 500000 * 3000 * 8 / 1024. / 1024. ~= 11GB
>
> Coef_ is stored as a dense matrix - you might get a considerable
> smaller matrix if you use sparse regularization (keeps most
> coefficients zero) and convert the coef_ array to a scipy sparse
> matrix prior to saving the object - this should cut your store costs
> by a factor of 10-100.
>
> To check the sparsity of ``coef_`` use::
>
> sparsity = lambda clf: clf.coef_.nonzero()[1].shape[0] / float(clf.coef_.size)
>
> To convert the coef_ array do::
>
> clf = ... # your fitted model
> clf.coef_ = scipy.sparse.csr_matrix(clf.coef_)
>
>
> Prediction doesn't work currently (raises an Error) when coef_ is a
> sparse matrix rather than an numpy array - this is a bug in sklearn
> that should be fixed - I'll submit a PR for this.
> In the meanwhile please convert back to a numpy array or patch the
> SGDClassifier.decision_function method (adding ``dense_output=True``
> when calling ``safe_sparse_dot`` should do the trick).
>
> best,
>  Peter
>
> PS: I strongly recommend using sparse regularization (using
> penatly='l1' or penalty='elasticnet') - this should cut your sparsity
> significantly.
>
> 2013/2/22 Ark <[email protected]>:
>>>
>>> You could cut that in half by converting coef_ and optionally
>>> intercept_ to np.float32 (that's not officially supported, but with
>>> the current implementation it should work):
>>>
>>>     clf.coef_ = np.astype(clf.coef_, np.float32)
>>>
>>> You could also try the HashingVectorizer in sklearn.feature_extraction
>>> and see if performance is still acceptable with a small number of
>>> features. That also skips storing the vocabulary, which I imagine will
>>> be quite large as well.
>>>
>>  HashingVectorizer might indeed save some space...will test for acceptable
>> answer...
>>
>>> (I hope you meant 12000 document *per class*?)
>>>
>>  :( Unfortunately, no, I have 12000 documents in all..atleast as a start 
>> point,
>> Initially it is just to collect metrics, and as time goes on, mode
>> documents per category will be added. Besides I am also limited on train time
>> which seems to go over hour as the number of samples goes up..[My very first
>> attempt was with 200k documents].
>> Thanks for the suggestions.
>>
>>
>>
>>
>>
>> ------------------------------------------------------------------------------
>> Everyone hates slow websites. So do we.
>> Make your web apps faster with AppDynamics
>> Download AppDynamics Lite for free today:
>> http://p.sf.net/sfu/appdyn_d2d_feb
>> _______________________________________________
>> Scikit-learn-general mailing list
>> [email protected]
>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>
>
>
> --
> Peter Prettenhofer



-- 
Peter Prettenhofer

------------------------------------------------------------------------------
Everyone hates slow websites. So do we.
Make your web apps faster with AppDynamics
Download AppDynamics Lite for free today:
http://p.sf.net/sfu/appdyn_d2d_feb
_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Reply via email to