k-Means questions

Grant Ingersoll Thu, 25 Jun 2009 15:50:25 -0700

Do people have recommendations for start clusters (seeds) for k-Means. The synthetic control example uses Canopy and I often seeRandom selection mentioned, but I'm wondering what's considered to bebest practices for obtaining good overall results.

Also, how best to take the Random approach. On a small data set, Ican easily crank out a program to loop randomly select vectors, but itseems like in a HDFS environment, you'd need a M/R job just to do thatinitial selection of random documents. Back in my parallelcomputation days (a _long_ time ago) on big old iron, I seem to recallthere being work on parallel/distributed RNG, is that useful here oris that overkill? Does Hadoop offer tools for this?

Also, is it just me, or does the KMeansDriver need to take in "k" oris this just assumed from the number of initial input clusters?


Thanks,
Grant

k-Means questions

Reply via email to