Do people have recommendations for start clusters (seeds) for k-
Means. The synthetic control example uses Canopy and I often see
Random selection mentioned, but I'm wondering what's considered to be
best practices for obtaining good overall results.
Also, how best to take the Random approach. On a small data set, I
can easily crank out a program to loop randomly select vectors, but it
seems like in a HDFS environment, you'd need a M/R job just to do that
initial selection of random documents. Back in my parallel
computation days (a _long_ time ago) on big old iron, I seem to recall
there being work on parallel/distributed RNG, is that useful here or
is that overkill? Does Hadoop offer tools for this?
Also, is it just me, or does the KMeansDriver need to take in "k" or
is this just assumed from the number of initial input clusters?
Thanks,
Grant
- k-Means questions Grant Ingersoll
-