2011/10/7 Ian Goodfellow <[email protected]>: > I understand that LinearSVC is implemented using liblinear, which I thought > should work well with large datasets. However, when I pass LinearSVC.fit a > design matrix of size 40,000 x 14,400 (in float32 format, so 2.3 gigabytes) > > it ends up using at least 8 additional gigabytes of RAM!
I am pretty sure that liblinear uses a sparse format based on 64bit floats and integer indices internally. So if your data is dense (very few zeros) that means: 64 / 8 * 2 * 40000 * 14000 = 9.2GB which looks inline with what you report. If you have many zeros in your data it is possible that the fact that you use numpy arrays prevents the scikit-learn liblinear mapper to leverage that to save some space. Using scipy.sparse.csr_matrix and the sklearn.svm.sparse.LinearSVC variant might help you save so RAM by avoid storing the zeros explicitly. > I know that the numpy array passed to scikits needs to be C contiguous to > avoid it being copied internally. I've checked and mine is, so that's not > the issue. > Is it normal for LinearSVC.fit to use so much memory? And if so, is this due > to some intrinsic requirement of the algorithm, the implementation of > liblinear, or the implementation of LinearSVC? > I'm using scikits.learn version 0.4 installed using apt-get in ubuntu 11.04, > if that's relevant. You should definitely use the latests version of scikit-learn: 0.9 was released a couple of weeks ago. The master from github is also quite stable. Read the release notes to follow the API changes: http://scikit-learn.sourceforge.net/dev/whats_new.html It would fix your issue though since the internal format of liblinear has not changed between those versions. Many other bugs might have been solved though. -- Olivier http://twitter.com/ogrisel - http://github.com/ogrisel ------------------------------------------------------------------------------ All of the data generated in your IT infrastructure is seriously valuable. Why? It contains a definitive record of application performance, security threats, fraudulent activity, and more. Splunk takes this data and makes sense of it. IT sense. And common sense. http://p.sf.net/sfu/splunk-d2dcopy2 _______________________________________________ Scikit-learn-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
