@ark: for 500K features and 3K classes your coef_ matrix will be:
500000 * 3000 * 8 / 1024. / 1024. ~= 11GB

Coef_ is stored as a dense matrix - you might get a considerable
smaller matrix if you use sparse regularization (keeps most
coefficients zero) and convert the coef_ array to a scipy sparse
matrix prior to saving the object - this should cut your store costs
by a factor of 10-100.

To check the sparsity of ``coef_`` use::

sparsity = lambda clf: clf.coef_.nonzero()[1].shape[0] / float(clf.coef_.size)

To convert the coef_ array do::

clf = ... # your fitted model
clf.coef_ = scipy.sparse.csr_matrix(clf.coef_)


Prediction doesn't work currently (raises an Error) when coef_ is a
sparse matrix rather than an numpy array - this is a bug in sklearn
that should be fixed - I'll submit a PR for this.
In the meanwhile please convert back to a numpy array or patch the
SGDClassifier.decision_function method (adding ``dense_output=True``
when calling ``safe_sparse_dot`` should do the trick).

best,
 Peter

PS: I strongly recommend using sparse regularization (using
penatly='l1' or penalty='elasticnet') - this should cut your sparsity
significantly.

2013/2/22 Ark <[email protected]>:
>>
>> You could cut that in half by converting coef_ and optionally
>> intercept_ to np.float32 (that's not officially supported, but with
>> the current implementation it should work):
>>
>>     clf.coef_ = np.astype(clf.coef_, np.float32)
>>
>> You could also try the HashingVectorizer in sklearn.feature_extraction
>> and see if performance is still acceptable with a small number of
>> features. That also skips storing the vocabulary, which I imagine will
>> be quite large as well.
>>
>  HashingVectorizer might indeed save some space...will test for acceptable
> answer...
>
>> (I hope you meant 12000 document *per class*?)
>>
>  :( Unfortunately, no, I have 12000 documents in all..atleast as a start 
> point,
> Initially it is just to collect metrics, and as time goes on, mode
> documents per category will be added. Besides I am also limited on train time
> which seems to go over hour as the number of samples goes up..[My very first
> attempt was with 200k documents].
> Thanks for the suggestions.
>
>
>
>
>
> ------------------------------------------------------------------------------
> Everyone hates slow websites. So do we.
> Make your web apps faster with AppDynamics
> Download AppDynamics Lite for free today:
> http://p.sf.net/sfu/appdyn_d2d_feb
> _______________________________________________
> Scikit-learn-general mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general



-- 
Peter Prettenhofer

------------------------------------------------------------------------------
Everyone hates slow websites. So do we.
Make your web apps faster with AppDynamics
Download AppDynamics Lite for free today:
http://p.sf.net/sfu/appdyn_d2d_feb
_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Reply via email to