Hey RJ,

Sorry for the delay, I'd be happy to take a look at this if you can post the 
code!

I think splitting the largest cluster in each round is fairly common, but 
ideally it would be an option to do it one way or the other.

-- Jeremy

---------------------
jeremy freeman, phd
neuroscientist
@thefreemanlab

On Aug 12, 2014, at 2:20 PM, RJ Nowling <rnowl...@gmail.com> wrote:

> Hi all,
> 
> I wanted to follow up.
> 
> I have a prototype for an optimized version of hierarchical k-means.  I
> wanted to get some feedback on my apporach.
> 
> Jeremy's implementation splits the largest cluster in each round.  Is it
> better to do it that way or to split each cluster in half?
> 
> Are there are any open-source examples that are being widely used in
> production?
> 
> Thanks!
> 
> 
> 
> On Fri, Jul 18, 2014 at 8:05 AM, RJ Nowling <rnowl...@gmail.com> wrote:
> 
>> Nice to meet you, Jeremy!
>> 
>> This is great!  Hierarchical clustering was next on my list --
>> currently trying to get my PR for MiniBatch KMeans accepted.
>> 
>> If it's cool with you, I'll try converting your code to fit in with
>> the existing MLLib code as you suggest. I also need to review the
>> Decision Tree code (as suggested above) to see how much of that can be
>> reused.
>> 
>> Maybe I can ask you to do a code review for me when I'm done?
>> 
>> 
>> 
>> 
>> 
>> On Thu, Jul 17, 2014 at 8:31 PM, Jeremy Freeman
>> <freeman.jer...@gmail.com> wrote:
>>> Hi all,
>>> 
>>> Cool discussion! I agree that a more standardized API for clustering, and
>>> easy access to underlying routines, would be useful (we've also been
>>> discussing this when trying to develop streaming clustering algorithms,
>>> similar to https://github.com/apache/spark/pull/1361)
>>> 
>>> For divisive, hierarchical clustering I implemented something awhile
>> back,
>>> here's a gist.
>>> 
>>> https://gist.github.com/freeman-lab/5947e7c53b368fe90371
>>> 
>>> It does bisecting k-means clustering (with k=2), with a recursive class
>> for
>>> keeping track of the tree. I also found this much better than
>> agglomerative
>>> methods (for the reasons Hector points out).
>>> 
>>> This needs to be cleaned up, and can surely be optimized (esp. by
>> replacing
>>> the core KMeans step with existing MLLib code), but I can say I was
>> running
>>> it successfully on quite large data sets.
>>> 
>>> RJ, depending on where you are in your progress, I'd be happy to help
>> work
>>> on this piece and / or have you use this as a jumping off point, if
>> useful.
>>> 
>>> -- Jeremy
>>> 
>>> 
>>> 
>>> --
>>> View this message in context:
>> http://apache-spark-developers-list.1001551.n3.nabble.com/Contributing-to-MLlib-Proposal-for-Clustering-Algorithms-tp7212p7398.html
>>> Sent from the Apache Spark Developers List mailing list archive at
>> Nabble.com.
>> 
>> 
>> 
>> --
>> em rnowl...@gmail.com
>> c 954.496.2314
>> 
> 
> 
> 
> -- 
> em rnowl...@gmail.com
> c 954.496.2314

Reply via email to