Op 24 april 2012 21:13 heeft Rafael Calsaverini
<[email protected]> het volgende geschreven:
> I'm getting a memory error trying to do KernelPCA on a data set of
> 30.000 texts. RandomizedPCA works alright. I think what's happening is
> that RandomizedPCA works with sparse arrays and KernelPCA don't.
>
> Does anyone have a list of learning methods that are currently
> implemented with sparse array support in scikits-learn?
Methods' docstrings should state whether they supports sparse matrix
input. E.g., RandomizedPCA.fit's docstring states
X: array-like or scipy.sparse matrix, shape (n_samples, n_features)
Following Scipy terminology, we don't consider scipy.sparse matrices array-like.
There's no separate list of estimators/function that support sparse
input. You can look at
examples/document_{classification_20newsgroups,clustering} for some of
the algorithms that work well on text.
As for KernelPCA, it in fact does support sparse matrices, even though
this is not advertised; however, it may not handle large n_samples
elegantly since it computes an n_samples × n_samples Gram matrix. On
my laptop, KernelPCA exhibits the following behavior:
* 1000×173419 sparse matrix: no problem.
* 18846×173419: keeps running indefinitely with very high memory consumption.
* 1000×173419 dense array: MemoryError.
Prefer RandomizedPCA :)
--
Lars Buitinck
Scientific programmer, ILPS
University of Amsterdam
------------------------------------------------------------------------------
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and
threat landscape has changed and how IT managers can respond. Discussions
will include endpoint security, mobile security and the latest in malware
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general