2012/9/10 denis <[email protected]>: > Olivier, +1, > I had k= instead of n_clusters= -- > drew a warning but not the same :(
Thanks for the bug report. > Fwiw, > for seed in range(5): > mbkm = MiniBatchKMeans( 10, random_state=seed, verbose=1 > ).fit(digits.data) > > --> > seed 0: clusters [294 205 194 188 185 184 178 165 117 87] > seed 1: clusters [280 203 185 181 177 174 170 150 147 130] > seed 2: clusters [288 220 201 183 179 165 158 149 136 118] > seed 3: clusters [342 229 204 178 176 168 153 148 108 91] > seed 4: clusters [398 197 187 178 178 171 165 125 107 91] > > shows how poor kmeans is here; is it good anywhere ? why is it poor? Because. kmeans make the assumption that the data is structured as a set of well separated convex clusters (for instance "Gaussian blobs"). There is no reason that 8x8 pixel data of digits would be a union of 10 well separated convex clusters. It's very likely that the digits samples lie on curved low(er) dimensional manifolds (as is usually the case with pixel data). You should try SpectralClustering on datasets where you know that your data is "curvy-manifoldish". -- Olivier http://twitter.com/ogrisel - http://github.com/ogrisel ------------------------------------------------------------------------------ Live Security Virtual Conference Exclusive live event will cover all the ways today's security and threat landscape has changed and how IT managers can respond. Discussions will include endpoint security, mobile security and the latest in malware threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ _______________________________________________ Scikit-learn-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
