Here is a good paper on initializing k-means with kd-trees: http://www.sciencedirect.com/science/article/pii/S0167865507000165 You can use FLANN package for large data sets. Make sure to read FLANN documentation <http://www.cs.ubc.ca/research/flann/uploads/FLANN/flann_manual-1.8.4.pdf> for familiarizing with its options.
-- Art On Monday, July 28, 2014 11:20:08 AM UTC-4, John Myles White wrote: > > Art wrapped FLANN, which should make things much easier. > > — John > > On Jul 28, 2014, at 8:19 AM, Dahua Lin <[email protected] <javascript:>> > wrote: > > We should probably incorporate (approximate) nearest neighbor search into > Clustering.jl at some point. > > Dahua > > > On Monday, July 28, 2014 10:09:59 AM UTC-5, John Myles White wrote: >> >> FWIW, there’s a KD-tree implementation in NearestNeighbors.jl >> >> — John >> >> On Jul 28, 2014, at 7:27 AM, Jacob Quinn <[email protected]> wrote: >> >> This probably isn't very helpful currently, but I've been meaning to try >> to do a `kd-tree` implementation that allows for fast clustering for up to >> 7-10 dimensions. (there's also ad-trees for categorical data that has even >> better performance gains over traditional algorithms). >> >> >> http://www.autonlab.org/autonweb/14669/version/2/part/5/data/moore-veryfast.pdf?branch=main&language=en >> >> As a fun fact, Andrew Moore (author of the two algorithms/data structures >> mentioned above) started the Google Pittsburgh office after leaving CMU and >> he's just agreed to come back to CMU as the new dean of computer science! >> >> -Jacob >> >> >> >> On Mon, Jul 28, 2014 at 9:31 AM, René Donner <[email protected]> wrote: >> >>> Hi, >>> >>> perhaps Quick-Shift clustering might be interesting as well [1]. It is >>> easy to implement, fast, and in contrast to k-means / k-medoids (which it >>> generalizes) has the very appealing property that the initial, hierachical >>> data-structure has to be computed only once - you can then investigate >>> different settings of the parameter \tau (the splitting criterium) >>> extremely fast. >>> >>> In many cases it is easier to find a reasonable \tau than to come up >>> with the exact number of clusters your data is expected to have. >>> >>> Cheers, >>> >>> Rene >>> >>> [1] http://www.robots.ox.ac.uk/~vedaldi/assets/pubs/vedaldi08quick.pdf >>> >>> >>> >>> >>> >>> Am 28.07.2014 um 15:06 schrieb Randy Zwitch <[email protected]>: >>> >>> > I'm about to undertake a clustering exercise for a lot of data >>> (Roughly 100MM rows*12 columns for every week, mixed floats/ints, for as >>> many weeks as I choose to use). I was going to attempt to downsample to 1MM >>> events or so and use the Clustering.jl package to try and pre-gather some >>> idea of how many clusters to estimate, since clustering a billion or more >>> events will take a bit of computation time. I'm familiar with the 'elbow >>> method' of determining k, but that seems a bit arbitrary. >>> > >>> > Is anyone familiar with either of the techniques described in these >>> two papers? There is a blog post (link) that states that the f(K) method is >>> an order of magnitude better in performance time by removing the need for >>> monte carlo methods. If anyone has any practical experience with these or >>> advice about other methods (bonus for providing Julia code!), it would be >>> much appreciated. >>> > >>> > http://www.stanford.edu/~hastie/Papers/gap.pdf >>> > >>> > http://www.ee.columbia.edu/~dpwe/papers/PhamDN05-kmeans.pdf >>> > >>> > >>> >>> >> >> >
