2011/10/7 Ian Goodfellow <[email protected]>:
> I understand that LinearSVC is implemented using liblinear, which I thought
> should work well with large datasets. However, when I pass LinearSVC.fit a
> design matrix of size 40,000 x 14,400 (in float32 format, so 2.3 gigabytes)
>
> it ends up using at least 8 additional gigabytes of RAM!

I am pretty sure that liblinear uses a sparse format based on 64bit
floats and integer indices internally. So if your data is dense (very
few zeros) that means:

64 / 8 * 2 * 40000 * 14000 = 9.2GB which looks inline with what you report.

If you have many zeros in your data it is possible that the fact that
you use numpy arrays prevents the scikit-learn liblinear mapper to
leverage that to save some space. Using scipy.sparse.csr_matrix and
the sklearn.svm.sparse.LinearSVC variant might help you save so RAM by
avoid storing the zeros explicitly.

> I know that the numpy array passed to scikits needs to be C contiguous to
> avoid it being copied internally. I've checked and mine is, so that's not
> the issue.
> Is it normal for LinearSVC.fit to use so much memory? And if so, is this due
> to some intrinsic requirement of the algorithm, the implementation of
> liblinear, or the implementation of LinearSVC?
> I'm using scikits.learn version 0.4 installed using apt-get in ubuntu 11.04,
> if that's relevant.

You should definitely use the latests version of scikit-learn: 0.9 was
released a couple of weeks ago. The master from github is also quite
stable. Read the release notes to follow the API changes:

  http://scikit-learn.sourceforge.net/dev/whats_new.html

It would fix your issue though since the internal format of liblinear
has not changed between those versions. Many other bugs might have
been solved though.

-- 
Olivier
http://twitter.com/ogrisel - http://github.com/ogrisel

------------------------------------------------------------------------------
All of the data generated in your IT infrastructure is seriously valuable.
Why? It contains a definitive record of application performance, security
threats, fraudulent activity, and more. Splunk takes this data and makes
sense of it. IT sense. And common sense.
http://p.sf.net/sfu/splunk-d2dcopy2
_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Reply via email to