Re: [julia-users] Semi-OT: Finding optimal k in k-means

John Myles White Mon, 28 Jul 2014 08:20:41 -0700

Art wrapped FLANN, which should make things much easier.

 — John


On Jul 28, 2014, at 8:19 AM, Dahua Lin <[email protected]> wrote:

> We should probably incorporate (approximate) nearest neighbor search into 
> Clustering.jl at some point.
> 
> Dahua
> 
> 
> On Monday, July 28, 2014 10:09:59 AM UTC-5, John Myles White wrote:
> FWIW, there’s a KD-tree implementation in NearestNeighbors.jl
> 
>  — John
> 
> On Jul 28, 2014, at 7:27 AM, Jacob Quinn <[email protected]> wrote:
> 
>> This probably isn't very helpful currently, but I've been meaning to try to 
>> do a `kd-tree` implementation that allows for fast clustering for up to 7-10 
>> dimensions. (there's also ad-trees for categorical data that has even better 
>> performance gains over traditional algorithms).
>> 
>> http://www.autonlab.org/autonweb/14669/version/2/part/5/data/moore-veryfast.pdf?branch=main&language=en
>> 
>> As a fun fact, Andrew Moore (author of the two algorithms/data structures 
>> mentioned above) started the Google Pittsburgh office after leaving CMU and 
>> he's just agreed to come back to CMU as the new dean of computer science!
>> 
>> -Jacob
>> 
>> 
>> 
>> On Mon, Jul 28, 2014 at 9:31 AM, René Donner <[email protected]> wrote:
>> Hi,
>> 
>> perhaps Quick-Shift clustering might be interesting as well [1]. It is easy 
>> to implement, fast, and in contrast to k-means / k-medoids (which it 
>> generalizes) has the very appealing property that the initial, hierachical 
>> data-structure has to be computed only once - you can then investigate 
>> different settings of the parameter \tau (the splitting criterium) extremely 
>> fast.
>> 
>> In many cases it is easier to find a reasonable \tau than to come up with 
>> the exact number of clusters your data is expected to have.
>> 
>> Cheers,
>> 
>> Rene
>> 
>> [1] http://www.robots.ox.ac.uk/~vedaldi/assets/pubs/vedaldi08quick.pdf
>> 
>> 
>> 
>> 
>> 
>> Am 28.07.2014 um 15:06 schrieb Randy Zwitch <[email protected]>:
>> 
>> > I'm about to undertake a clustering exercise for a lot of data (Roughly 
>> > 100MM rows*12 columns for every week, mixed floats/ints, for as many weeks 
>> > as I choose to use). I was going to attempt to downsample to 1MM events or 
>> > so and use the Clustering.jl package to try and pre-gather some idea of 
>> > how many clusters to estimate, since clustering a billion or more events 
>> > will take a bit of computation time. I'm familiar with the 'elbow method' 
>> > of determining k, but that seems a bit arbitrary.
>> >
>> > Is anyone familiar with either of the techniques described in these two 
>> > papers? There is a blog post (link) that states that the f(K) method is an 
>> > order of magnitude better in performance time by removing the need for 
>> > monte carlo methods. If anyone has any practical experience with these or 
>> > advice about other methods (bonus for providing Julia code!), it would be 
>> > much appreciated.
>> >
>> > http://www.stanford.edu/~hastie/Papers/gap.pdf
>> >
>> > http://www.ee.columbia.edu/~dpwe/papers/PhamDN05-kmeans.pdf
>> >
>> >
>> 
>> 
>

Re: [julia-users] Semi-OT: Finding optimal k in k-means

Reply via email to