I was just wondering: does the current l1 penalty implementation actually
lead to sparse coef_?
I though additional tricks were required for that.
If it is the case, maybe an example would be nice?


On 02/22/2013 11:15 AM, Peter Prettenhofer wrote:
> I just opened a PR for this issue:
> https://github.com/scikit-learn/scikit-learn/pull/1702
>
> 2013/2/22 Peter Prettenhofer <[email protected]>:
>> @ark: for 500K features and 3K classes your coef_ matrix will be:
>> 500000 * 3000 * 8 / 1024. / 1024. ~= 11GB
>>
>> Coef_ is stored as a dense matrix - you might get a considerable
>> smaller matrix if you use sparse regularization (keeps most
>> coefficients zero) and convert the coef_ array to a scipy sparse
>> matrix prior to saving the object - this should cut your store costs
>> by a factor of 10-100.
>>
>> To check the sparsity of ``coef_`` use::
>>
>> sparsity = lambda clf: clf.coef_.nonzero()[1].shape[0] / 
>> float(clf.coef_.size)
>>
>> To convert the coef_ array do::
>>
>> clf = ... # your fitted model
>> clf.coef_ = scipy.sparse.csr_matrix(clf.coef_)
>>
>>
>> Prediction doesn't work currently (raises an Error) when coef_ is a
>> sparse matrix rather than an numpy array - this is a bug in sklearn
>> that should be fixed - I'll submit a PR for this.
>> In the meanwhile please convert back to a numpy array or patch the
>> SGDClassifier.decision_function method (adding ``dense_output=True``
>> when calling ``safe_sparse_dot`` should do the trick).
>>
>> best,
>>   Peter
>>
>> PS: I strongly recommend using sparse regularization (using
>> penatly='l1' or penalty='elasticnet') - this should cut your sparsity
>> significantly.
>>
>> 2013/2/22 Ark <[email protected]>:
>>>> You could cut that in half by converting coef_ and optionally
>>>> intercept_ to np.float32 (that's not officially supported, but with
>>>> the current implementation it should work):
>>>>
>>>>      clf.coef_ = np.astype(clf.coef_, np.float32)
>>>>
>>>> You could also try the HashingVectorizer in sklearn.feature_extraction
>>>> and see if performance is still acceptable with a small number of
>>>> features. That also skips storing the vocabulary, which I imagine will
>>>> be quite large as well.
>>>>
>>>   HashingVectorizer might indeed save some space...will test for acceptable
>>> answer...
>>>
>>>> (I hope you meant 12000 document *per class*?)
>>>>
>>>   :( Unfortunately, no, I have 12000 documents in all..atleast as a start 
>>> point,
>>> Initially it is just to collect metrics, and as time goes on, mode
>>> documents per category will be added. Besides I am also limited on train 
>>> time
>>> which seems to go over hour as the number of samples goes up..[My very first
>>> attempt was with 200k documents].
>>> Thanks for the suggestions.
>>>
>>>
>>>
>>>
>>>
>>> ------------------------------------------------------------------------------
>>> Everyone hates slow websites. So do we.
>>> Make your web apps faster with AppDynamics
>>> Download AppDynamics Lite for free today:
>>> http://p.sf.net/sfu/appdyn_d2d_feb
>>> _______________________________________________
>>> Scikit-learn-general mailing list
>>> [email protected]
>>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>
>>
>> --
>> Peter Prettenhofer
>
>


------------------------------------------------------------------------------
Everyone hates slow websites. So do we.
Make your web apps faster with AppDynamics
Download AppDynamics Lite for free today:
http://p.sf.net/sfu/appdyn_d2d_feb
_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Reply via email to