Thanks, Jeremy. I'm abandoning my initial approach, and I'll work on optimizing your example (so it doesn't do the breeze-vector conversions every time KMeans is called). I need to finish a few other projects first, though, so it may be a couple weeks.
In the mean time, Yu also created a JIRA for a hierarchical KMeans implementation. I pointed him to your example and a couple papers I found. If you or Yu beat me to getting an implementation in, I'd be happy to review it. :) On Wed, Aug 27, 2014 at 12:18 PM, Jeremy Freeman <freeman.jer...@gmail.com> wrote: > Hey RJ, > > Sorry for the delay, I'd be happy to take a look at this if you can post > the code! > > I think splitting the largest cluster in each round is fairly common, but > ideally it would be an option to do it one way or the other. > > -- Jeremy > > --------------------- > jeremy freeman, phd > neuroscientist > @thefreemanlab > > On Aug 12, 2014, at 2:20 PM, RJ Nowling <rnowl...@gmail.com> wrote: > > Hi all, > > I wanted to follow up. > > I have a prototype for an optimized version of hierarchical k-means. I > wanted to get some feedback on my apporach. > > Jeremy's implementation splits the largest cluster in each round. Is it > better to do it that way or to split each cluster in half? > > Are there are any open-source examples that are being widely used in > production? > > Thanks! > > > > On Fri, Jul 18, 2014 at 8:05 AM, RJ Nowling <rnowl...@gmail.com> wrote: > > Nice to meet you, Jeremy! > > This is great! Hierarchical clustering was next on my list -- > currently trying to get my PR for MiniBatch KMeans accepted. > > If it's cool with you, I'll try converting your code to fit in with > the existing MLLib code as you suggest. I also need to review the > Decision Tree code (as suggested above) to see how much of that can be > reused. > > Maybe I can ask you to do a code review for me when I'm done? > > > > > > On Thu, Jul 17, 2014 at 8:31 PM, Jeremy Freeman > <freeman.jer...@gmail.com> wrote: > > Hi all, > > Cool discussion! I agree that a more standardized API for clustering, and > easy access to underlying routines, would be useful (we've also been > discussing this when trying to develop streaming clustering algorithms, > similar to https://github.com/apache/spark/pull/1361) > > For divisive, hierarchical clustering I implemented something awhile > > back, > > here's a gist. > > https://gist.github.com/freeman-lab/5947e7c53b368fe90371 > > It does bisecting k-means clustering (with k=2), with a recursive class > > for > > keeping track of the tree. I also found this much better than > > agglomerative > > methods (for the reasons Hector points out). > > This needs to be cleaned up, and can surely be optimized (esp. by > > replacing > > the core KMeans step with existing MLLib code), but I can say I was > > running > > it successfully on quite large data sets. > > RJ, depending on where you are in your progress, I'd be happy to help > > work > > on this piece and / or have you use this as a jumping off point, if > > useful. > > > -- Jeremy > > > > -- > View this message in context: > > > http://apache-spark-developers-list.1001551.n3.nabble.com/Contributing-to-MLlib-Proposal-for-Clustering-Algorithms-tp7212p7398.html > > Sent from the Apache Spark Developers List mailing list archive at > > Nabble.com. > > > > -- > em rnowl...@gmail.com > c 954.496.2314 > > > > > -- > em rnowl...@gmail.com > c 954.496.2314 > > > -- em rnowl...@gmail.com c 954.496.2314