2011/12/5 Ian Goodfellow <[email protected]>: > On Fri, Dec 2, 2011 at 3:36 AM, Olivier Grisel <[email protected]> > wrote: >> 2011/12/2 Ian Goodfellow <[email protected]>: >>> On Fri, Oct 7, 2011 at 5:14 AM, Olivier Grisel <[email protected]> >>> wrote: >>>> 2011/10/7 Ian Goodfellow <[email protected]>: >>>>> Thanks. Yes it does appear that liblinear uses only a 64 bit dense format, >>>>> so this memory usage is normal/caused by the implementation of liblinear. >>>>> >>>>> You may want to update the documentation hosted at this site: >>>>> http://scikit-learn.sourceforge.net/modules/svm.html# >>>>> >>>>> It has a section on "avoiding data copy" which only says that the data >>>>> should be C contiguous. >>>> >>>> Thanks for the report, this should be fixed: >>>> >>>> https://github.com/scikit-learn/scikit-learn/commit/bf68b538e8fe251303fc0f7469aad6e8bf56a1d0 >>>> >>>>> It looks like there's a different implementation of libsvm that uses a >>>>> dense >>>>> format so I'll look into using that. >>>> >>>> Yes the libsvm in scikit-learn can use both dense and sparse inputs >>>> (actually we embed both the original sparse implementation and a dense >>>> fork of it). >>> >>> How can I access the dense version of libsvm via scikit-learn? >> >> Those are the classes at the sklearn.svm level as opposed to the >> classes at the sklearn.svm.sparse level. > > This would seem to invalidate the explanation of the excessive memory > consumption of the sklearn.svm classes earlier in this e-mail thread. > If I've been using the dense version all along, why is the memory > consumption so high? > > If I train using an 11 GB design matrix, it ends up getting killed on > a machine with 64 GB of RAM. If the only issue were converting to 64 > bit it ought to use on the order of 33 GB of RAM (11 to hold the > original data and 22 to hold the converted data). Does the training > algorithm itself construct very large data structures for intermediate > results? Is there a way to verify that sklearn is using dense libsvm > under the hood?
If you use kernel svm.SVC (from libsvm) there is a kernel cache but its size is bound and the default limit should be much smaller (200MB IIRC). Also I am not sure if this cache is enabled or not in when using a linear kernel. There are also the support vectors them-selves but I thought that libsvm would only manipulate their indices (position in the training set) without copying them so I don't think this can explain your issue. If you use LinearSVC (from liblinear) there is no dense version. So if you data is based on dense float it will be copied into a sparse data structure of doubles and integers. That should still fit in 64GB though. > The conversion to sparse matrices sounded like a > fairly plausible explanation for the memory consumption I was seeing. Maybe there is a memory leak. It's hard to say without a reproduction case. -- Olivier http://twitter.com/ogrisel - http://github.com/ogrisel ------------------------------------------------------------------------------ All the data continuously generated in your IT infrastructure contains a definitive record of customers, application performance, security threats, fraudulent activity, and more. Splunk takes this data and makes sense of it. IT sense. And common sense. http://p.sf.net/sfu/splunk-novd2d _______________________________________________ Scikit-learn-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
