On Fri, Dec 2, 2011 at 3:36 AM, Olivier Grisel <[email protected]> wrote: > 2011/12/2 Ian Goodfellow <[email protected]>: >> On Fri, Oct 7, 2011 at 5:14 AM, Olivier Grisel <[email protected]> >> wrote: >>> 2011/10/7 Ian Goodfellow <[email protected]>: >>>> Thanks. Yes it does appear that liblinear uses only a 64 bit dense format, >>>> so this memory usage is normal/caused by the implementation of liblinear. >>>> >>>> You may want to update the documentation hosted at this site: >>>> http://scikit-learn.sourceforge.net/modules/svm.html# >>>> >>>> It has a section on "avoiding data copy" which only says that the data >>>> should be C contiguous. >>> >>> Thanks for the report, this should be fixed: >>> >>> https://github.com/scikit-learn/scikit-learn/commit/bf68b538e8fe251303fc0f7469aad6e8bf56a1d0 >>> >>>> It looks like there's a different implementation of libsvm that uses a >>>> dense >>>> format so I'll look into using that. >>> >>> Yes the libsvm in scikit-learn can use both dense and sparse inputs >>> (actually we embed both the original sparse implementation and a dense >>> fork of it). >> >> How can I access the dense version of libsvm via scikit-learn? > > Those are the classes at the sklearn.svm level as opposed to the > classes at the sklearn.svm.sparse level.
This would seem to invalidate the explanation of the excessive memory consumption of the sklearn.svm classes earlier in this e-mail thread. If I've been using the dense version all along, why is the memory consumption so high? If I train using an 11 GB design matrix, it ends up getting killed on a machine with 64 GB of RAM. If the only issue were converting to 64 bit it ought to use on the order of 33 GB of RAM (11 to hold the original data and 22 to hold the converted data). Does the training algorithm itself construct very large data structures for intermediate results? Is there a way to verify that sklearn is using dense libsvm under the hood? The conversion to sparse matrices sounded like a fairly plausible explanation for the memory consumption I was seeing. > > -- > Olivier > http://twitter.com/ogrisel - http://github.com/ogrisel > > ------------------------------------------------------------------------------ > All the data continuously generated in your IT infrastructure > contains a definitive record of customers, application performance, > security threats, fraudulent activity, and more. Splunk takes this > data and makes sense of it. IT sense. And common sense. > http://p.sf.net/sfu/splunk-novd2d > _______________________________________________ > Scikit-learn-general mailing list > [email protected] > https://lists.sourceforge.net/lists/listinfo/scikit-learn-general ------------------------------------------------------------------------------ All the data continuously generated in your IT infrastructure contains a definitive record of customers, application performance, security threats, fraudulent activity, and more. Splunk takes this data and makes sense of it. IT sense. And common sense. http://p.sf.net/sfu/splunk-novd2d _______________________________________________ Scikit-learn-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
