Re: [julia-users] Semi-OT: Finding optimal k in k-means

Jacob Quinn Mon, 28 Jul 2014 07:28:22 -0700

This probably isn't very helpful currently, but I've been meaning to try to
do a `kd-tree` implementation that allows for fast clustering for up to
7-10 dimensions. (there's also ad-trees for categorical data that has even
better performance gains over traditional algorithms).


http://www.autonlab.org/autonweb/14669/version/2/part/5/data/moore-veryfast.pdf?branch=main&language=en

As a fun fact, Andrew Moore (author of the two algorithms/data structures
mentioned above) started the Google Pittsburgh office after leaving CMU and
he's just agreed to come back to CMU as the new dean of computer science!

-Jacob



On Mon, Jul 28, 2014 at 9:31 AM, René Donner <[email protected]> wrote:

> Hi,
>
> perhaps Quick-Shift clustering might be interesting as well [1]. It is
> easy to implement, fast, and in contrast to k-means / k-medoids (which it
> generalizes) has the very appealing property that the initial, hierachical
> data-structure has to be computed only once - you can then investigate
> different settings of the parameter \tau (the splitting criterium)
> extremely fast.
>
> In many cases it is easier to find a reasonable \tau than to come up with
> the exact number of clusters your data is expected to have.
>
> Cheers,
>
> Rene
>
> [1] http://www.robots.ox.ac.uk/~vedaldi/assets/pubs/vedaldi08quick.pdf
>
>
>
>
>
> Am 28.07.2014 um 15:06 schrieb Randy Zwitch <[email protected]>:
>
> > I'm about to undertake a clustering exercise for a lot of data (Roughly
> 100MM rows*12 columns for every week, mixed floats/ints, for as many weeks
> as I choose to use). I was going to attempt to downsample to 1MM events or
> so and use the Clustering.jl package to try and pre-gather some idea of how
> many clusters to estimate, since clustering a billion or more events will
> take a bit of computation time. I'm familiar with the 'elbow method' of
> determining k, but that seems a bit arbitrary.
> >
> > Is anyone familiar with either of the techniques described in these two
> papers? There is a blog post (link) that states that the f(K) method is an
> order of magnitude better in performance time by removing the need for
> monte carlo methods. If anyone has any practical experience with these or
> advice about other methods (bonus for providing Julia code!), it would be
> much appreciated.
> >
> > http://www.stanford.edu/~hastie/Papers/gap.pdf
> >
> > http://www.ee.columbia.edu/~dpwe/papers/PhamDN05-kmeans.pdf
> >
> >
>
>

Re: [julia-users] Semi-OT: Finding optimal k in k-means

Reply via email to