Re: [Scikit-learn-general] MiniBatchKMeans 10 classes: n_clusters= not k= Olivier +1

Olivier Grisel Mon, 10 Sep 2012 04:46:59 -0700

2012/9/10 denis <[email protected]>:
> Olivier, +1,
>    I had k= instead of n_clusters= --
> drew a warning  but not the same :(


Thanks for the bug report.

> Fwiw,
> for seed in range(5):
>      mbkm = MiniBatchKMeans( 10, random_state=seed, verbose=1
> ).fit(digits.data)
>
> -->
> seed 0: clusters [294 205 194 188 185 184 178 165 117  87]
> seed 1: clusters [280 203 185 181 177 174 170 150 147 130]
> seed 2: clusters [288 220 201 183 179 165 158 149 136 118]
> seed 3: clusters [342 229 204 178 176 168 153 148 108  91]
> seed 4: clusters [398 197 187 178 178 171 165 125 107  91]
>
> shows how poor kmeans is here; is it good anywhere ?

why is it poor? Because.

kmeans make the assumption that the data is structured as a set of
well separated convex clusters (for instance "Gaussian blobs"). There
is no reason that 8x8 pixel data of digits would be a union of 10 well
separated convex clusters. It's very likely that the digits samples
lie on curved low(er) dimensional manifolds (as is usually the case
with pixel data).

You should try SpectralClustering on datasets where you know that your
data is "curvy-manifoldish".

-- 
Olivier
http://twitter.com/ogrisel - http://github.com/ogrisel

------------------------------------------------------------------------------
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and 
threat landscape has changed and how IT managers can respond. Discussions 
will include endpoint security, mobile security and the latest in malware 
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] MiniBatchKMeans 10 classes: n_clusters= not k= Olivier +1

Reply via email to