Re: Contributing to MLlib: Proposal for Clustering Algorithms

RJ Nowling Fri, 18 Jul 2014 05:07:28 -0700

Nice to meet you, Jeremy!

This is great!  Hierarchical clustering was next on my list --
currently trying to get my PR for MiniBatch KMeans accepted.


If it's cool with you, I'll try converting your code to fit in with
the existing MLLib code as you suggest. I also need to review the
Decision Tree code (as suggested above) to see how much of that can be
reused.

Maybe I can ask you to do a code review for me when I'm done?





On Thu, Jul 17, 2014 at 8:31 PM, Jeremy Freeman
<[email protected]> wrote:
> Hi all,
>
> Cool discussion! I agree that a more standardized API for clustering, and
> easy access to underlying routines, would be useful (we've also been
> discussing this when trying to develop streaming clustering algorithms,
> similar to https://github.com/apache/spark/pull/1361)
>
> For divisive, hierarchical clustering I implemented something awhile back,
> here's a gist.
>
> https://gist.github.com/freeman-lab/5947e7c53b368fe90371
>
> It does bisecting k-means clustering (with k=2), with a recursive class for
> keeping track of the tree. I also found this much better than agglomerative
> methods (for the reasons Hector points out).
>
> This needs to be cleaned up, and can surely be optimized (esp. by replacing
> the core KMeans step with existing MLLib code), but I can say I was running
> it successfully on quite large data sets.
>
> RJ, depending on where you are in your progress, I'd be happy to help work
> on this piece and / or have you use this as a jumping off point, if useful.
>
> -- Jeremy
>
>
>
> --
> View this message in context: 
> http://apache-spark-developers-list.1001551.n3.nabble.com/Contributing-to-MLlib-Proposal-for-Clustering-Algorithms-tp7212p7398.html
> Sent from the Apache Spark Developers List mailing list archive at Nabble.com.



-- 
em [email protected]
c 954.496.2314

Re: Contributing to MLlib: Proposal for Clustering Algorithms

Reply via email to