Thanks Rene and Jacob! These are both certainly helpful, as I don't have a really structured project I'm working on at the moment, I'm just supposed to try and "find something interesting" with all this hardware I have access to. So trying different methods beyond k-means I certainly possible, I'll just need to get my academic paper chops up a bit :)
On Monday, July 28, 2014 10:27:53 AM UTC-4, Jacob Quinn wrote: > > This probably isn't very helpful currently, but I've been meaning to try > to do a `kd-tree` implementation that allows for fast clustering for up to > 7-10 dimensions. (there's also ad-trees for categorical data that has even > better performance gains over traditional algorithms). > > > http://www.autonlab.org/autonweb/14669/version/2/part/5/data/moore-veryfast.pdf?branch=main&language=en > > As a fun fact, Andrew Moore (author of the two algorithms/data structures > mentioned above) started the Google Pittsburgh office after leaving CMU and > he's just agreed to come back to CMU as the new dean of computer science! > > -Jacob > > > > On Mon, Jul 28, 2014 at 9:31 AM, René Donner <[email protected] > <javascript:>> wrote: > >> Hi, >> >> perhaps Quick-Shift clustering might be interesting as well [1]. It is >> easy to implement, fast, and in contrast to k-means / k-medoids (which it >> generalizes) has the very appealing property that the initial, hierachical >> data-structure has to be computed only once - you can then investigate >> different settings of the parameter \tau (the splitting criterium) >> extremely fast. >> >> In many cases it is easier to find a reasonable \tau than to come up with >> the exact number of clusters your data is expected to have. >> >> Cheers, >> >> Rene >> >> [1] http://www.robots.ox.ac.uk/~vedaldi/assets/pubs/vedaldi08quick.pdf >> >> >> >> >> >> Am 28.07.2014 um 15:06 schrieb Randy Zwitch <[email protected] >> <javascript:>>: >> >> > I'm about to undertake a clustering exercise for a lot of data (Roughly >> 100MM rows*12 columns for every week, mixed floats/ints, for as many weeks >> as I choose to use). I was going to attempt to downsample to 1MM events or >> so and use the Clustering.jl package to try and pre-gather some idea of how >> many clusters to estimate, since clustering a billion or more events will >> take a bit of computation time. I'm familiar with the 'elbow method' of >> determining k, but that seems a bit arbitrary. >> > >> > Is anyone familiar with either of the techniques described in these two >> papers? There is a blog post (link) that states that the f(K) method is an >> order of magnitude better in performance time by removing the need for >> monte carlo methods. If anyone has any practical experience with these or >> advice about other methods (bonus for providing Julia code!), it would be >> much appreciated. >> > >> > http://www.stanford.edu/~hastie/Papers/gap.pdf >> > >> > http://www.ee.columbia.edu/~dpwe/papers/PhamDN05-kmeans.pdf >> > >> > >> >> >
