On Mon, Apr 02, 2012 at 08:19:49PM +0100, Martin Fergie wrote:
> Hi,
> 
> I need to cluster some integer data where the features are an unordered
> set, that is the two features
> [20, 1, 10] and
> [ 1, 20, 10] are equivalent and should be in the same cluster.
> 
> I think this is essentially similar to association rule data mining. Does
> anyone know how this can be achieved using sklearn? If not, can someone
> recommend me a suitable python package for clustering this type of data?
> 
> The data refer to image pixels locations, so each feature will be an
> integer ranging from 0 to potentially a few million. I'm likely to cluster
> problems of size 100,000 samples by 200 features (i.e. 200 pixel locations
> in each set).

You might try representing it as a sparse bag-of-words, i.e. a sparse matrix
of  100,000 x (several million), where each row contains a 1 in positions
where a feature is present and 0 otherwise. Such a representation should be
fairly efficient in CSR or CSC.

I'm not sure which clustering estimators in scikit-learn support sparse
inputs but there should be a couple.

David

------------------------------------------------------------------------------
Better than sec? Nothing is better than sec when it comes to
monitoring Big Data applications. Try Boundary one-second 
resolution app monitoring today. Free.
http://p.sf.net/sfu/Boundary-dev2dev
_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Reply via email to