Thanks, Jeremy.  I'm abandoning my initial approach, and I'll work on
optimizing your example (so it doesn't do the breeze-vector conversions
every time KMeans is called).  I need to finish a few other projects first,
though, so it may be a couple weeks.

In the mean time, Yu also created a JIRA for a hierarchical KMeans
implementation.  I pointed him to your example and a couple papers I found.

If you or Yu beat me to getting an implementation in, I'd be happy to
review it.  :)


On Wed, Aug 27, 2014 at 12:18 PM, Jeremy Freeman <freeman.jer...@gmail.com>
wrote:

> Hey RJ,
>
> Sorry for the delay, I'd be happy to take a look at this if you can post
> the code!
>
> I think splitting the largest cluster in each round is fairly common, but
> ideally it would be an option to do it one way or the other.
>
> -- Jeremy
>
> ---------------------
> jeremy freeman, phd
> neuroscientist
> @thefreemanlab
>
> On Aug 12, 2014, at 2:20 PM, RJ Nowling <rnowl...@gmail.com> wrote:
>
> Hi all,
>
> I wanted to follow up.
>
> I have a prototype for an optimized version of hierarchical k-means.  I
> wanted to get some feedback on my apporach.
>
> Jeremy's implementation splits the largest cluster in each round.  Is it
> better to do it that way or to split each cluster in half?
>
> Are there are any open-source examples that are being widely used in
> production?
>
> Thanks!
>
>
>
> On Fri, Jul 18, 2014 at 8:05 AM, RJ Nowling <rnowl...@gmail.com> wrote:
>
> Nice to meet you, Jeremy!
>
> This is great!  Hierarchical clustering was next on my list --
> currently trying to get my PR for MiniBatch KMeans accepted.
>
> If it's cool with you, I'll try converting your code to fit in with
> the existing MLLib code as you suggest. I also need to review the
> Decision Tree code (as suggested above) to see how much of that can be
> reused.
>
> Maybe I can ask you to do a code review for me when I'm done?
>
>
>
>
>
> On Thu, Jul 17, 2014 at 8:31 PM, Jeremy Freeman
> <freeman.jer...@gmail.com> wrote:
>
> Hi all,
>
> Cool discussion! I agree that a more standardized API for clustering, and
> easy access to underlying routines, would be useful (we've also been
> discussing this when trying to develop streaming clustering algorithms,
> similar to https://github.com/apache/spark/pull/1361)
>
> For divisive, hierarchical clustering I implemented something awhile
>
> back,
>
> here's a gist.
>
> https://gist.github.com/freeman-lab/5947e7c53b368fe90371
>
> It does bisecting k-means clustering (with k=2), with a recursive class
>
> for
>
> keeping track of the tree. I also found this much better than
>
> agglomerative
>
> methods (for the reasons Hector points out).
>
> This needs to be cleaned up, and can surely be optimized (esp. by
>
> replacing
>
> the core KMeans step with existing MLLib code), but I can say I was
>
> running
>
> it successfully on quite large data sets.
>
> RJ, depending on where you are in your progress, I'd be happy to help
>
> work
>
> on this piece and / or have you use this as a jumping off point, if
>
> useful.
>
>
> -- Jeremy
>
>
>
> --
> View this message in context:
>
>
> http://apache-spark-developers-list.1001551.n3.nabble.com/Contributing-to-MLlib-Proposal-for-Clustering-Algorithms-tp7212p7398.html
>
> Sent from the Apache Spark Developers List mailing list archive at
>
> Nabble.com.
>
>
>
> --
> em rnowl...@gmail.com
> c 954.496.2314
>
>
>
>
> --
> em rnowl...@gmail.com
> c 954.496.2314
>
>
>


-- 
em rnowl...@gmail.com
c 954.496.2314

Reply via email to