On Mon, Dec 5, 2011 at 4:24 PM, Olivier Grisel <[email protected]> wrote:
> 2011/12/5 Ian Goodfellow <[email protected]>:
>> On Fri, Dec 2, 2011 at 3:36 AM, Olivier Grisel <[email protected]> 
>> wrote:
>>> 2011/12/2 Ian Goodfellow <[email protected]>:
>>>> On Fri, Oct 7, 2011 at 5:14 AM, Olivier Grisel <[email protected]> 
>>>> wrote:
>>>>> 2011/10/7 Ian Goodfellow <[email protected]>:
>>>>>> Thanks. Yes it does appear that liblinear uses only a 64 bit dense 
>>>>>> format,
>>>>>> so this memory usage is normal/caused by the implementation of liblinear.
>>>>>>
>>>>>> You may want to update the documentation hosted at this site:
>>>>>> http://scikit-learn.sourceforge.net/modules/svm.html#
>>>>>>
>>>>>> It has a section on "avoiding data copy" which only says that the data
>>>>>> should be C contiguous.
>>>>>
>>>>> Thanks for the report, this should be fixed:
>>>>>
>>>>>  https://github.com/scikit-learn/scikit-learn/commit/bf68b538e8fe251303fc0f7469aad6e8bf56a1d0
>>>>>
>>>>>> It looks like there's a different implementation of libsvm that uses a 
>>>>>> dense
>>>>>> format so I'll look into using that.
>>>>>
>>>>> Yes the libsvm in scikit-learn can use both dense and sparse inputs
>>>>> (actually we embed both the original sparse implementation and a dense
>>>>> fork of it).
>>>>
>>>> How can I access the dense version of libsvm via scikit-learn?
>>>
>>> Those are the classes at the sklearn.svm level as opposed to the
>>> classes at the sklearn.svm.sparse level.
>>
>> This would seem to invalidate the explanation of the excessive memory
>> consumption of the sklearn.svm classes earlier in this e-mail thread.
>> If I've been using the dense version all along, why is the memory
>> consumption so high?
>>
>> If I train using an 11 GB design matrix, it ends up getting killed on
>> a machine with 64 GB of RAM. If the only issue were converting to 64
>> bit it ought to use on the order of 33 GB of RAM (11 to hold the
>> original data and 22 to hold the converted data). Does the training
>> algorithm itself construct very large data structures for intermediate
>> results? Is there a way to verify that sklearn is using dense libsvm
>> under the hood?
>
> If you use kernel svm.SVC (from libsvm) there is a kernel cache but
> its size is bound and the default limit should be much smaller (200MB
> IIRC). Also I am not sure if this cache is enabled or not in when
> using a linear kernel. There are also the support vectors them-selves
> but I thought that libsvm would only manipulate their indices
> (position in the training set) without copying them so I don't think
> this can explain your issue.
>
> If you use LinearSVC (from liblinear) there is no dense version. So if
> you data is based on dense float it will be copied into a sparse data
> structure of doubles and integers. That should still fit in 64GB
> though.
>
>> The conversion to sparse matrices sounded like a
>> fairly plausible explanation for the memory consumption I was seeing.
>
> Maybe there is a memory leak. It's hard to say without a reproduction case.


ok, I was using LinearSVC, so I guess I am still not using the dense
implementation.

Is there a way to use one-against-rest rather than one-against-many
classification with the SVC class? This page makes it sound like the
primary difference between SVC and LinearSVC is that SVC uses
one-against-one while LinearSVC uses one-against-rest:
http://scikit-learn.sourceforge.net/dev/modules/svm.html


By the way, I suggest someone update the documentation to specify what
the consequences of using the different SVM classes are. Currently
LinearSVC is recommend "for huge datasets", not "for huge sparse
datasets." That is on
this page:
http://scikit-learn.sourceforge.net/dev/modules/generated/sklearn.svm.LinearSVC.html

>
> --
> Olivier
> http://twitter.com/ogrisel - http://github.com/ogrisel
>
> ------------------------------------------------------------------------------
> All the data continuously generated in your IT infrastructure
> contains a definitive record of customers, application performance,
> security threats, fraudulent activity, and more. Splunk takes this
> data and makes sense of it. IT sense. And common sense.
> http://p.sf.net/sfu/splunk-novd2d
> _______________________________________________
> Scikit-learn-general mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

------------------------------------------------------------------------------
All the data continuously generated in your IT infrastructure 
contains a definitive record of customers, application performance, 
security threats, fraudulent activity, and more. Splunk takes this 
data and makes sense of it. IT sense. And common sense.
http://p.sf.net/sfu/splunk-novd2d
_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Reply via email to