Hi,

perhaps Quick-Shift clustering might be interesting as well [1]. It is easy to 
implement, fast, and in contrast to k-means / k-medoids (which it generalizes) 
has the very appealing property that the initial, hierachical data-structure 
has to be computed only once - you can then investigate different settings of 
the parameter \tau (the splitting criterium) extremely fast. 

In many cases it is easier to find a reasonable \tau than to come up with the 
exact number of clusters your data is expected to have.

Cheers,

Rene

[1] http://www.robots.ox.ac.uk/~vedaldi/assets/pubs/vedaldi08quick.pdf





Am 28.07.2014 um 15:06 schrieb Randy Zwitch <[email protected]>:

> I'm about to undertake a clustering exercise for a lot of data (Roughly 100MM 
> rows*12 columns for every week, mixed floats/ints, for as many weeks as I 
> choose to use). I was going to attempt to downsample to 1MM events or so and 
> use the Clustering.jl package to try and pre-gather some idea of how many 
> clusters to estimate, since clustering a billion or more events will take a 
> bit of computation time. I'm familiar with the 'elbow method' of determining 
> k, but that seems a bit arbitrary.
> 
> Is anyone familiar with either of the techniques described in these two 
> papers? There is a blog post (link) that states that the f(K) method is an 
> order of magnitude better in performance time by removing the need for monte 
> carlo methods. If anyone has any practical experience with these or advice 
> about other methods (bonus for providing Julia code!), it would be much 
> appreciated.
> 
> http://www.stanford.edu/~hastie/Papers/gap.pdf
> 
> http://www.ee.columbia.edu/~dpwe/papers/PhamDN05-kmeans.pdf
> 
> 

Reply via email to