I'm about to undertake a clustering exercise for a lot of data (Roughly 
100MM rows*12 columns for every week, mixed floats/ints, for as many weeks 
as I choose to use). I was going to attempt to downsample to 1MM events or 
so and use the Clustering.jl package to try and pre-gather some idea of how 
many clusters to estimate, since clustering a billion or more events will 
take a bit of computation time. I'm familiar with the 'elbow method' of 
determining k, but that seems a bit arbitrary.

Is anyone familiar with either of the techniques described in these two 
papers? There is a blog post (link 
<http://datasciencelab.wordpress.com/2014/01/21/selection-of-k-in-k-means-clustering-reloaded/>)
 
that states that the f(K) method is an order of magnitude better in 
performance time by removing the need for monte carlo methods. If anyone 
has any practical experience with these or advice about other methods 
(bonus for providing Julia code!), it would be much appreciated.

http://www.stanford.edu/~hastie/Papers/gap.pdf

http://www.ee.columbia.edu/~dpwe/papers/PhamDN05-kmeans.pdf


Reply via email to