look at sklearn.multiclass
Alex On Mon, Dec 5, 2011 at 10:37 PM, Ian Goodfellow <[email protected]> wrote: > On Mon, Dec 5, 2011 at 4:24 PM, Olivier Grisel <[email protected]> > wrote: >> 2011/12/5 Ian Goodfellow <[email protected]>: >>> On Fri, Dec 2, 2011 at 3:36 AM, Olivier Grisel <[email protected]> >>> wrote: >>>> 2011/12/2 Ian Goodfellow <[email protected]>: >>>>> On Fri, Oct 7, 2011 at 5:14 AM, Olivier Grisel <[email protected]> >>>>> wrote: >>>>>> 2011/10/7 Ian Goodfellow <[email protected]>: >>>>>>> Thanks. Yes it does appear that liblinear uses only a 64 bit dense >>>>>>> format, >>>>>>> so this memory usage is normal/caused by the implementation of >>>>>>> liblinear. >>>>>>> >>>>>>> You may want to update the documentation hosted at this site: >>>>>>> http://scikit-learn.sourceforge.net/modules/svm.html# >>>>>>> >>>>>>> It has a section on "avoiding data copy" which only says that the data >>>>>>> should be C contiguous. >>>>>> >>>>>> Thanks for the report, this should be fixed: >>>>>> >>>>>> https://github.com/scikit-learn/scikit-learn/commit/bf68b538e8fe251303fc0f7469aad6e8bf56a1d0 >>>>>> >>>>>>> It looks like there's a different implementation of libsvm that uses a >>>>>>> dense >>>>>>> format so I'll look into using that. >>>>>> >>>>>> Yes the libsvm in scikit-learn can use both dense and sparse inputs >>>>>> (actually we embed both the original sparse implementation and a dense >>>>>> fork of it). >>>>> >>>>> How can I access the dense version of libsvm via scikit-learn? >>>> >>>> Those are the classes at the sklearn.svm level as opposed to the >>>> classes at the sklearn.svm.sparse level. >>> >>> This would seem to invalidate the explanation of the excessive memory >>> consumption of the sklearn.svm classes earlier in this e-mail thread. >>> If I've been using the dense version all along, why is the memory >>> consumption so high? >>> >>> If I train using an 11 GB design matrix, it ends up getting killed on >>> a machine with 64 GB of RAM. If the only issue were converting to 64 >>> bit it ought to use on the order of 33 GB of RAM (11 to hold the >>> original data and 22 to hold the converted data). Does the training >>> algorithm itself construct very large data structures for intermediate >>> results? Is there a way to verify that sklearn is using dense libsvm >>> under the hood? >> >> If you use kernel svm.SVC (from libsvm) there is a kernel cache but >> its size is bound and the default limit should be much smaller (200MB >> IIRC). Also I am not sure if this cache is enabled or not in when >> using a linear kernel. There are also the support vectors them-selves >> but I thought that libsvm would only manipulate their indices >> (position in the training set) without copying them so I don't think >> this can explain your issue. >> >> If you use LinearSVC (from liblinear) there is no dense version. So if >> you data is based on dense float it will be copied into a sparse data >> structure of doubles and integers. That should still fit in 64GB >> though. >> >>> The conversion to sparse matrices sounded like a >>> fairly plausible explanation for the memory consumption I was seeing. >> >> Maybe there is a memory leak. It's hard to say without a reproduction case. > > > ok, I was using LinearSVC, so I guess I am still not using the dense > implementation. > > Is there a way to use one-against-rest rather than one-against-many > classification with the SVC class? This page makes it sound like the > primary difference between SVC and LinearSVC is that SVC uses > one-against-one while LinearSVC uses one-against-rest: > http://scikit-learn.sourceforge.net/dev/modules/svm.html > > > By the way, I suggest someone update the documentation to specify what > the consequences of using the different SVM classes are. Currently > LinearSVC is recommend "for huge datasets", not "for huge sparse > datasets." That is on > this page: > http://scikit-learn.sourceforge.net/dev/modules/generated/sklearn.svm.LinearSVC.html > >> >> -- >> Olivier >> http://twitter.com/ogrisel - http://github.com/ogrisel >> >> ------------------------------------------------------------------------------ >> All the data continuously generated in your IT infrastructure >> contains a definitive record of customers, application performance, >> security threats, fraudulent activity, and more. Splunk takes this >> data and makes sense of it. IT sense. And common sense. >> http://p.sf.net/sfu/splunk-novd2d >> _______________________________________________ >> Scikit-learn-general mailing list >> [email protected] >> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general > > ------------------------------------------------------------------------------ > All the data continuously generated in your IT infrastructure > contains a definitive record of customers, application performance, > security threats, fraudulent activity, and more. Splunk takes this > data and makes sense of it. IT sense. And common sense. > http://p.sf.net/sfu/splunk-novd2d > _______________________________________________ > Scikit-learn-general mailing list > [email protected] > https://lists.sourceforge.net/lists/listinfo/scikit-learn-general ------------------------------------------------------------------------------ All the data continuously generated in your IT infrastructure contains a definitive record of customers, application performance, security threats, fraudulent activity, and more. Splunk takes this data and makes sense of it. IT sense. And common sense. http://p.sf.net/sfu/splunk-novd2d _______________________________________________ Scikit-learn-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
