On Mon, Apr 02, 2012 at 08:19:49PM +0100, Martin Fergie wrote: > Hi, > > I need to cluster some integer data where the features are an unordered > set, that is the two features > [20, 1, 10] and > [ 1, 20, 10] are equivalent and should be in the same cluster. > > I think this is essentially similar to association rule data mining. Does > anyone know how this can be achieved using sklearn? If not, can someone > recommend me a suitable python package for clustering this type of data? > > The data refer to image pixels locations, so each feature will be an > integer ranging from 0 to potentially a few million. I'm likely to cluster > problems of size 100,000 samples by 200 features (i.e. 200 pixel locations > in each set).
You might try representing it as a sparse bag-of-words, i.e. a sparse matrix of 100,000 x (several million), where each row contains a 1 in positions where a feature is present and 0 otherwise. Such a representation should be fairly efficient in CSR or CSC. I'm not sure which clustering estimators in scikit-learn support sparse inputs but there should be a couple. David ------------------------------------------------------------------------------ Better than sec? Nothing is better than sec when it comes to monitoring Big Data applications. Try Boundary one-second resolution app monitoring today. Free. http://p.sf.net/sfu/Boundary-dev2dev _______________________________________________ Scikit-learn-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
