Re: [Scikit-learn-general] Memory consumption of LinearSVC.fit

Alexandre Gramfort Mon, 05 Dec 2011 13:48:57 -0800

look at

sklearn.multiclass


Alex

On Mon, Dec 5, 2011 at 10:37 PM, Ian Goodfellow
<[email protected]> wrote:
> On Mon, Dec 5, 2011 at 4:24 PM, Olivier Grisel <[email protected]> 
> wrote:
>> 2011/12/5 Ian Goodfellow <[email protected]>:
>>> On Fri, Dec 2, 2011 at 3:36 AM, Olivier Grisel <[email protected]> 
>>> wrote:
>>>> 2011/12/2 Ian Goodfellow <[email protected]>:
>>>>> On Fri, Oct 7, 2011 at 5:14 AM, Olivier Grisel <[email protected]> 
>>>>> wrote:
>>>>>> 2011/10/7 Ian Goodfellow <[email protected]>:
>>>>>>> Thanks. Yes it does appear that liblinear uses only a 64 bit dense 
>>>>>>> format,
>>>>>>> so this memory usage is normal/caused by the implementation of 
>>>>>>> liblinear.
>>>>>>>
>>>>>>> You may want to update the documentation hosted at this site:
>>>>>>> http://scikit-learn.sourceforge.net/modules/svm.html#
>>>>>>>
>>>>>>> It has a section on "avoiding data copy" which only says that the data
>>>>>>> should be C contiguous.
>>>>>>
>>>>>> Thanks for the report, this should be fixed:
>>>>>>
>>>>>>  https://github.com/scikit-learn/scikit-learn/commit/bf68b538e8fe251303fc0f7469aad6e8bf56a1d0
>>>>>>
>>>>>>> It looks like there's a different implementation of libsvm that uses a 
>>>>>>> dense
>>>>>>> format so I'll look into using that.
>>>>>>
>>>>>> Yes the libsvm in scikit-learn can use both dense and sparse inputs
>>>>>> (actually we embed both the original sparse implementation and a dense
>>>>>> fork of it).
>>>>>
>>>>> How can I access the dense version of libsvm via scikit-learn?
>>>>
>>>> Those are the classes at the sklearn.svm level as opposed to the
>>>> classes at the sklearn.svm.sparse level.
>>>
>>> This would seem to invalidate the explanation of the excessive memory
>>> consumption of the sklearn.svm classes earlier in this e-mail thread.
>>> If I've been using the dense version all along, why is the memory
>>> consumption so high?
>>>
>>> If I train using an 11 GB design matrix, it ends up getting killed on
>>> a machine with 64 GB of RAM. If the only issue were converting to 64
>>> bit it ought to use on the order of 33 GB of RAM (11 to hold the
>>> original data and 22 to hold the converted data). Does the training
>>> algorithm itself construct very large data structures for intermediate
>>> results? Is there a way to verify that sklearn is using dense libsvm
>>> under the hood?
>>
>> If you use kernel svm.SVC (from libsvm) there is a kernel cache but
>> its size is bound and the default limit should be much smaller (200MB
>> IIRC). Also I am not sure if this cache is enabled or not in when
>> using a linear kernel. There are also the support vectors them-selves
>> but I thought that libsvm would only manipulate their indices
>> (position in the training set) without copying them so I don't think
>> this can explain your issue.
>>
>> If you use LinearSVC (from liblinear) there is no dense version. So if
>> you data is based on dense float it will be copied into a sparse data
>> structure of doubles and integers. That should still fit in 64GB
>> though.
>>
>>> The conversion to sparse matrices sounded like a
>>> fairly plausible explanation for the memory consumption I was seeing.
>>
>> Maybe there is a memory leak. It's hard to say without a reproduction case.
>
>
> ok, I was using LinearSVC, so I guess I am still not using the dense
> implementation.
>
> Is there a way to use one-against-rest rather than one-against-many
> classification with the SVC class? This page makes it sound like the
> primary difference between SVC and LinearSVC is that SVC uses
> one-against-one while LinearSVC uses one-against-rest:
> http://scikit-learn.sourceforge.net/dev/modules/svm.html
>
>
> By the way, I suggest someone update the documentation to specify what
> the consequences of using the different SVM classes are. Currently
> LinearSVC is recommend "for huge datasets", not "for huge sparse
> datasets." That is on
> this page:
> http://scikit-learn.sourceforge.net/dev/modules/generated/sklearn.svm.LinearSVC.html
>
>>
>> --
>> Olivier
>> http://twitter.com/ogrisel - http://github.com/ogrisel
>>
>> ------------------------------------------------------------------------------
>> All the data continuously generated in your IT infrastructure
>> contains a definitive record of customers, application performance,
>> security threats, fraudulent activity, and more. Splunk takes this
>> data and makes sense of it. IT sense. And common sense.
>> http://p.sf.net/sfu/splunk-novd2d
>> _______________________________________________
>> Scikit-learn-general mailing list
>> [email protected]
>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>
> ------------------------------------------------------------------------------
> All the data continuously generated in your IT infrastructure
> contains a definitive record of customers, application performance,
> security threats, fraudulent activity, and more. Splunk takes this
> data and makes sense of it. IT sense. And common sense.
> http://p.sf.net/sfu/splunk-novd2d
> _______________________________________________
> Scikit-learn-general mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

------------------------------------------------------------------------------
All the data continuously generated in your IT infrastructure 
contains a definitive record of customers, application performance, 
security threats, fraudulent activity, and more. Splunk takes this 
data and makes sense of it. IT sense. And common sense.
http://p.sf.net/sfu/splunk-novd2d
_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] Memory consumption of LinearSVC.fit

Reply via email to