Re: [Scikit-learn-general] Memory consumption of LinearSVC.fit

Olivier Grisel Mon, 05 Dec 2011 13:24:42 -0800

2011/12/5 Ian Goodfellow <[email protected]>:
> On Fri, Dec 2, 2011 at 3:36 AM, Olivier Grisel <[email protected]> 
> wrote:
>> 2011/12/2 Ian Goodfellow <[email protected]>:
>>> On Fri, Oct 7, 2011 at 5:14 AM, Olivier Grisel <[email protected]> 
>>> wrote:
>>>> 2011/10/7 Ian Goodfellow <[email protected]>:
>>>>> Thanks. Yes it does appear that liblinear uses only a 64 bit dense format,
>>>>> so this memory usage is normal/caused by the implementation of liblinear.
>>>>>
>>>>> You may want to update the documentation hosted at this site:
>>>>> http://scikit-learn.sourceforge.net/modules/svm.html#
>>>>>
>>>>> It has a section on "avoiding data copy" which only says that the data
>>>>> should be C contiguous.
>>>>
>>>> Thanks for the report, this should be fixed:
>>>>
>>>>  https://github.com/scikit-learn/scikit-learn/commit/bf68b538e8fe251303fc0f7469aad6e8bf56a1d0
>>>>
>>>>> It looks like there's a different implementation of libsvm that uses a 
>>>>> dense
>>>>> format so I'll look into using that.
>>>>
>>>> Yes the libsvm in scikit-learn can use both dense and sparse inputs
>>>> (actually we embed both the original sparse implementation and a dense
>>>> fork of it).
>>>
>>> How can I access the dense version of libsvm via scikit-learn?
>>
>> Those are the classes at the sklearn.svm level as opposed to the
>> classes at the sklearn.svm.sparse level.
>
> This would seem to invalidate the explanation of the excessive memory
> consumption of the sklearn.svm classes earlier in this e-mail thread.
> If I've been using the dense version all along, why is the memory
> consumption so high?
>
> If I train using an 11 GB design matrix, it ends up getting killed on
> a machine with 64 GB of RAM. If the only issue were converting to 64
> bit it ought to use on the order of 33 GB of RAM (11 to hold the
> original data and 22 to hold the converted data). Does the training
> algorithm itself construct very large data structures for intermediate
> results? Is there a way to verify that sklearn is using dense libsvm
> under the hood?


If you use kernel svm.SVC (from libsvm) there is a kernel cache but
its size is bound and the default limit should be much smaller (200MB
IIRC). Also I am not sure if this cache is enabled or not in when
using a linear kernel. There are also the support vectors them-selves
but I thought that libsvm would only manipulate their indices
(position in the training set) without copying them so I don't think
this can explain your issue.

If you use LinearSVC (from liblinear) there is no dense version. So if
you data is based on dense float it will be copied into a sparse data
structure of doubles and integers. That should still fit in 64GB
though.

> The conversion to sparse matrices sounded like a
> fairly plausible explanation for the memory consumption I was seeing.

Maybe there is a memory leak. It's hard to say without a reproduction case.

-- 
Olivier
http://twitter.com/ogrisel - http://github.com/ogrisel

------------------------------------------------------------------------------
All the data continuously generated in your IT infrastructure 
contains a definitive record of customers, application performance, 
security threats, fraudulent activity, and more. Splunk takes this 
data and makes sense of it. IT sense. And common sense.
http://p.sf.net/sfu/splunk-novd2d
_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] Memory consumption of LinearSVC.fit

Reply via email to