Op 3 april 2012 00:51 heeft David Warde-Farley
<[email protected]> het volgende geschreven:
> You might try representing it as a sparse bag-of-words, i.e. a sparse matrix
> of  100,000 x (several million), where each row contains a 1 in positions
> where a feature is present and 0 otherwise. Such a representation should be
> fairly efficient in CSR or CSC.

Good idea. It's easier if you go through an intermediate DOK matrix:

    >>> x1 = [20, 1, 10]
    >>> x2 = [ 1, 20, 10]
    >>> X = dok_matrix((2, 100))  # replace 100 with the maximum pixel index
    >>> for i in x1:
    ...     X[0, i] = 1
    ...
    >>> for i in x2:
    ...     X[1, i] = 1
    ...
    >>> X = X.tocsr()


> I'm not sure which clustering estimators in scikit-learn support sparse
> inputs but there should be a couple.

KMeans accepts sparse matrices; so do the metrics.pairwise functions,
so any clustering algorithm that accepts a square distance matrix
should be fine as well.

    >>> from sklearn.metrics.pairwise import euclidean_distances
    >>> euclidean_distances(X, X)
    array([[ 0.,  0.],
           [ 0.,  0.]])


-- 
Lars Buitinck
Scientific programmer, ILPS
University of Amsterdam

------------------------------------------------------------------------------
Better than sec? Nothing is better than sec when it comes to
monitoring Big Data applications. Try Boundary one-second 
resolution app monitoring today. Free.
http://p.sf.net/sfu/Boundary-dev2dev
_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Reply via email to