On 25 April 2012 06:38, Lars Buitinck <[email protected]> wrote:

> Op 24 april 2012 21:13 heeft Rafael Calsaverini
> <[email protected]> het volgende geschreven:
> > I'm getting a memory error trying to do KernelPCA on a data set of
> > 30.000 texts. RandomizedPCA works alright. I think what's happening is
> > that RandomizedPCA works with sparse arrays and KernelPCA don't.
> >
> > Does anyone have a list of learning methods that are currently
> > implemented with sparse array support in scikits-learn?
>
> Methods' docstrings should state whether they supports sparse matrix
> input. E.g., RandomizedPCA.fit's docstring states
>
>    X: array-like or scipy.sparse matrix, shape (n_samples, n_features)
>
> Following Scipy terminology, we don't consider scipy.sparse matrices
> array-like.
>
> There's no separate list of estimators/function that support sparse
> input. You can look at
> examples/document_{classification_20newsgroups,clustering} for some of
> the algorithms that work well on text.
>
> As for KernelPCA, it in fact does support sparse matrices, even though
> this is not advertised; however, it may not handle large n_samples
> elegantly since it computes an n_samples × n_samples Gram matrix. On
> my laptop, KernelPCA exhibits the following behavior:
>
> * 1000×173419 sparse matrix: no problem.
> * 18846×173419: keeps running indefinitely with very high memory
> consumption.
> * 1000×173419 dense array: MemoryError.
>
> Prefer RandomizedPCA :)
>
> --
> Lars Buitinck
> Scientific programmer, ILPS
> University of Amsterdam
>
>
> ------------------------------------------------------------------------------
> Live Security Virtual Conference
> Exclusive live event will cover all the ways today's security and
> threat landscape has changed and how IT managers can respond. Discussions
> will include endpoint security, mobile security and the latest in malware
> threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
> _______________________________________________
> Scikit-learn-general mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>


On my list of things to do is a stress testing suite, attempting to find
the upper/lower/sparse limits of each method, to generate a table for the
documentation. Unfortunately I don't have the time right now, but I thought
I would throw that out there as something coming "in the future".

- Robert


-- 

Public key at: http://pgp.mit.edu/ Search for this email address and select
the key from "2011-08-19" (key id: 54BA8735)
------------------------------------------------------------------------------
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and 
threat landscape has changed and how IT managers can respond. Discussions 
will include endpoint security, mobile security and the latest in malware 
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Reply via email to