I'm about to undertake a clustering exercise for a lot of data (Roughly 100MM rows*12 columns for every week, mixed floats/ints, for as many weeks as I choose to use). I was going to attempt to downsample to 1MM events or so and use the Clustering.jl package to try and pre-gather some idea of how many clusters to estimate, since clustering a billion or more events will take a bit of computation time. I'm familiar with the 'elbow method' of determining k, but that seems a bit arbitrary.
Is anyone familiar with either of the techniques described in these two papers? There is a blog post (link <http://datasciencelab.wordpress.com/2014/01/21/selection-of-k-in-k-means-clustering-reloaded/>) that states that the f(K) method is an order of magnitude better in performance time by removing the need for monte carlo methods. If anyone has any practical experience with these or advice about other methods (bonus for providing Julia code!), it would be much appreciated. http://www.stanford.edu/~hastie/Papers/gap.pdf http://www.ee.columbia.edu/~dpwe/papers/PhamDN05-kmeans.pdf
