Op 3 april 2012 00:51 heeft David Warde-Farley
<[email protected]> het volgende geschreven:
> You might try representing it as a sparse bag-of-words, i.e. a sparse matrix
> of 100,000 x (several million), where each row contains a 1 in positions
> where a feature is present and 0 otherwise. Such a representation should be
> fairly efficient in CSR or CSC.
Good idea. It's easier if you go through an intermediate DOK matrix:
>>> x1 = [20, 1, 10]
>>> x2 = [ 1, 20, 10]
>>> X = dok_matrix((2, 100)) # replace 100 with the maximum pixel index
>>> for i in x1:
... X[0, i] = 1
...
>>> for i in x2:
... X[1, i] = 1
...
>>> X = X.tocsr()
> I'm not sure which clustering estimators in scikit-learn support sparse
> inputs but there should be a couple.
KMeans accepts sparse matrices; so do the metrics.pairwise functions,
so any clustering algorithm that accepts a square distance matrix
should be fine as well.
>>> from sklearn.metrics.pairwise import euclidean_distances
>>> euclidean_distances(X, X)
array([[ 0., 0.],
[ 0., 0.]])
--
Lars Buitinck
Scientific programmer, ILPS
University of Amsterdam
------------------------------------------------------------------------------
Better than sec? Nothing is better than sec when it comes to
monitoring Big Data applications. Try Boundary one-second
resolution app monitoring today. Free.
http://p.sf.net/sfu/Boundary-dev2dev
_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general