Do people have recommendations for start clusters (seeds) for k- Means. The synthetic control example uses Canopy and I often see Random selection mentioned, but I'm wondering what's considered to be best practices for obtaining good overall results.

Also, how best to take the Random approach. On a small data set, I can easily crank out a program to loop randomly select vectors, but it seems like in a HDFS environment, you'd need a M/R job just to do that initial selection of random documents. Back in my parallel computation days (a _long_ time ago) on big old iron, I seem to recall there being work on parallel/distributed RNG, is that useful here or is that overkill? Does Hadoop offer tools for this?

Also, is it just me, or does the KMeansDriver need to take in "k" or is this just assumed from the number of initial input clusters?

Thanks,
Grant

Reply via email to