Re: Contributing to MLlib: Proposal for Clustering Algorithms

2014-08-27 Thread RJ Nowling
Hi Yu,

A standardized API has not been implemented yet.  I think it would be
better to implement the other clustering algorithms then extract a common
API.  Others may feel differently.  :)

Just a note, there was a pre-existing JIRA for hierarchical KMeans
SPARK-2429 https://issues.apache.org/jira/browse/SPARK-2429 I filed.  I
added a comment about previous discussion on the mailing list, example code
provided by a Jeremy Freeman, and a couple of papers I found.

Feel free to take this over -- I've played with it but haven't had time to
finish it.  I'd be happy to review the resulting code and discuss
approaches with you.

RJ



On Wed, Aug 13, 2014 at 9:20 AM, Yu Ishikawa yuu.ishikawa+sp...@gmail.com
wrote:

 Hi all,

 I am also interested in specifying a common framework.
 And I am trying to implement a hierarchical k-means and a hierarchical
 clustering like single-link method with LSH.
 https://issues.apache.org/jira/browse/SPARK-2966

 If you have designed the standardized clustering algorithms API, please let
 me know.


 best,
 Yu Ishikawa



 --
 View this message in context:
 http://apache-spark-developers-list.1001551.n3.nabble.com/Contributing-to-MLlib-Proposal-for-Clustering-Algorithms-tp7212p7822.html
 Sent from the Apache Spark Developers List mailing list archive at
 Nabble.com.

 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org




-- 
em rnowl...@gmail.com
c 954.496.2314


Re: Contributing to MLlib: Proposal for Clustering Algorithms

2014-08-27 Thread Jeremy Freeman
Hey RJ,

Sorry for the delay, I'd be happy to take a look at this if you can post the 
code!

I think splitting the largest cluster in each round is fairly common, but 
ideally it would be an option to do it one way or the other.

-- Jeremy

-
jeremy freeman, phd
neuroscientist
@thefreemanlab

On Aug 12, 2014, at 2:20 PM, RJ Nowling rnowl...@gmail.com wrote:

 Hi all,
 
 I wanted to follow up.
 
 I have a prototype for an optimized version of hierarchical k-means.  I
 wanted to get some feedback on my apporach.
 
 Jeremy's implementation splits the largest cluster in each round.  Is it
 better to do it that way or to split each cluster in half?
 
 Are there are any open-source examples that are being widely used in
 production?
 
 Thanks!
 
 
 
 On Fri, Jul 18, 2014 at 8:05 AM, RJ Nowling rnowl...@gmail.com wrote:
 
 Nice to meet you, Jeremy!
 
 This is great!  Hierarchical clustering was next on my list --
 currently trying to get my PR for MiniBatch KMeans accepted.
 
 If it's cool with you, I'll try converting your code to fit in with
 the existing MLLib code as you suggest. I also need to review the
 Decision Tree code (as suggested above) to see how much of that can be
 reused.
 
 Maybe I can ask you to do a code review for me when I'm done?
 
 
 
 
 
 On Thu, Jul 17, 2014 at 8:31 PM, Jeremy Freeman
 freeman.jer...@gmail.com wrote:
 Hi all,
 
 Cool discussion! I agree that a more standardized API for clustering, and
 easy access to underlying routines, would be useful (we've also been
 discussing this when trying to develop streaming clustering algorithms,
 similar to https://github.com/apache/spark/pull/1361)
 
 For divisive, hierarchical clustering I implemented something awhile
 back,
 here's a gist.
 
 https://gist.github.com/freeman-lab/5947e7c53b368fe90371
 
 It does bisecting k-means clustering (with k=2), with a recursive class
 for
 keeping track of the tree. I also found this much better than
 agglomerative
 methods (for the reasons Hector points out).
 
 This needs to be cleaned up, and can surely be optimized (esp. by
 replacing
 the core KMeans step with existing MLLib code), but I can say I was
 running
 it successfully on quite large data sets.
 
 RJ, depending on where you are in your progress, I'd be happy to help
 work
 on this piece and / or have you use this as a jumping off point, if
 useful.
 
 -- Jeremy
 
 
 
 --
 View this message in context:
 http://apache-spark-developers-list.1001551.n3.nabble.com/Contributing-to-MLlib-Proposal-for-Clustering-Algorithms-tp7212p7398.html
 Sent from the Apache Spark Developers List mailing list archive at
 Nabble.com.
 
 
 
 --
 em rnowl...@gmail.com
 c 954.496.2314
 
 
 
 
 -- 
 em rnowl...@gmail.com
 c 954.496.2314



Re: Contributing to MLlib: Proposal for Clustering Algorithms

2014-08-27 Thread RJ Nowling
Thanks, Jeremy.  I'm abandoning my initial approach, and I'll work on
optimizing your example (so it doesn't do the breeze-vector conversions
every time KMeans is called).  I need to finish a few other projects first,
though, so it may be a couple weeks.

In the mean time, Yu also created a JIRA for a hierarchical KMeans
implementation.  I pointed him to your example and a couple papers I found.

If you or Yu beat me to getting an implementation in, I'd be happy to
review it.  :)


On Wed, Aug 27, 2014 at 12:18 PM, Jeremy Freeman freeman.jer...@gmail.com
wrote:

 Hey RJ,

 Sorry for the delay, I'd be happy to take a look at this if you can post
 the code!

 I think splitting the largest cluster in each round is fairly common, but
 ideally it would be an option to do it one way or the other.

 -- Jeremy

 -
 jeremy freeman, phd
 neuroscientist
 @thefreemanlab

 On Aug 12, 2014, at 2:20 PM, RJ Nowling rnowl...@gmail.com wrote:

 Hi all,

 I wanted to follow up.

 I have a prototype for an optimized version of hierarchical k-means.  I
 wanted to get some feedback on my apporach.

 Jeremy's implementation splits the largest cluster in each round.  Is it
 better to do it that way or to split each cluster in half?

 Are there are any open-source examples that are being widely used in
 production?

 Thanks!



 On Fri, Jul 18, 2014 at 8:05 AM, RJ Nowling rnowl...@gmail.com wrote:

 Nice to meet you, Jeremy!

 This is great!  Hierarchical clustering was next on my list --
 currently trying to get my PR for MiniBatch KMeans accepted.

 If it's cool with you, I'll try converting your code to fit in with
 the existing MLLib code as you suggest. I also need to review the
 Decision Tree code (as suggested above) to see how much of that can be
 reused.

 Maybe I can ask you to do a code review for me when I'm done?





 On Thu, Jul 17, 2014 at 8:31 PM, Jeremy Freeman
 freeman.jer...@gmail.com wrote:

 Hi all,

 Cool discussion! I agree that a more standardized API for clustering, and
 easy access to underlying routines, would be useful (we've also been
 discussing this when trying to develop streaming clustering algorithms,
 similar to https://github.com/apache/spark/pull/1361)

 For divisive, hierarchical clustering I implemented something awhile

 back,

 here's a gist.

 https://gist.github.com/freeman-lab/5947e7c53b368fe90371

 It does bisecting k-means clustering (with k=2), with a recursive class

 for

 keeping track of the tree. I also found this much better than

 agglomerative

 methods (for the reasons Hector points out).

 This needs to be cleaned up, and can surely be optimized (esp. by

 replacing

 the core KMeans step with existing MLLib code), but I can say I was

 running

 it successfully on quite large data sets.

 RJ, depending on where you are in your progress, I'd be happy to help

 work

 on this piece and / or have you use this as a jumping off point, if

 useful.


 -- Jeremy



 --
 View this message in context:


 http://apache-spark-developers-list.1001551.n3.nabble.com/Contributing-to-MLlib-Proposal-for-Clustering-Algorithms-tp7212p7398.html

 Sent from the Apache Spark Developers List mailing list archive at

 Nabble.com.



 --
 em rnowl...@gmail.com
 c 954.496.2314




 --
 em rnowl...@gmail.com
 c 954.496.2314





-- 
em rnowl...@gmail.com
c 954.496.2314


Re: Contributing to MLlib: Proposal for Clustering Algorithms

2014-08-13 Thread Yu Ishikawa
Hi all,

I am also interested in specifying a common framework.
And I am trying to implement a hierarchical k-means and a hierarchical
clustering like single-link method with LSH.
https://issues.apache.org/jira/browse/SPARK-2966

If you have designed the standardized clustering algorithms API, please let
me know.


best,
Yu Ishikawa



--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/Contributing-to-MLlib-Proposal-for-Clustering-Algorithms-tp7212p7822.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Contributing to MLlib: Proposal for Clustering Algorithms

2014-08-12 Thread RJ Nowling
Hi all,

I wanted to follow up.

I have a prototype for an optimized version of hierarchical k-means.  I
wanted to get some feedback on my apporach.

Jeremy's implementation splits the largest cluster in each round.  Is it
better to do it that way or to split each cluster in half?

Are there are any open-source examples that are being widely used in
production?

Thanks!



On Fri, Jul 18, 2014 at 8:05 AM, RJ Nowling rnowl...@gmail.com wrote:

 Nice to meet you, Jeremy!

 This is great!  Hierarchical clustering was next on my list --
 currently trying to get my PR for MiniBatch KMeans accepted.

 If it's cool with you, I'll try converting your code to fit in with
 the existing MLLib code as you suggest. I also need to review the
 Decision Tree code (as suggested above) to see how much of that can be
 reused.

 Maybe I can ask you to do a code review for me when I'm done?





 On Thu, Jul 17, 2014 at 8:31 PM, Jeremy Freeman
 freeman.jer...@gmail.com wrote:
  Hi all,
 
  Cool discussion! I agree that a more standardized API for clustering, and
  easy access to underlying routines, would be useful (we've also been
  discussing this when trying to develop streaming clustering algorithms,
  similar to https://github.com/apache/spark/pull/1361)
 
  For divisive, hierarchical clustering I implemented something awhile
 back,
  here's a gist.
 
  https://gist.github.com/freeman-lab/5947e7c53b368fe90371
 
  It does bisecting k-means clustering (with k=2), with a recursive class
 for
  keeping track of the tree. I also found this much better than
 agglomerative
  methods (for the reasons Hector points out).
 
  This needs to be cleaned up, and can surely be optimized (esp. by
 replacing
  the core KMeans step with existing MLLib code), but I can say I was
 running
  it successfully on quite large data sets.
 
  RJ, depending on where you are in your progress, I'd be happy to help
 work
  on this piece and / or have you use this as a jumping off point, if
 useful.
 
  -- Jeremy
 
 
 
  --
  View this message in context:
 http://apache-spark-developers-list.1001551.n3.nabble.com/Contributing-to-MLlib-Proposal-for-Clustering-Algorithms-tp7212p7398.html
  Sent from the Apache Spark Developers List mailing list archive at
 Nabble.com.



 --
 em rnowl...@gmail.com
 c 954.496.2314




-- 
em rnowl...@gmail.com
c 954.496.2314


Re: Contributing to MLlib: Proposal for Clustering Algorithms

2014-07-19 Thread Jeremy Freeman
Hi RJ, that sounds like a great idea. I'd be happy to look over what you put
together.

-- Jeremy



--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/Contributing-to-MLlib-Proposal-for-Clustering-Algorithms-tp7212p7418.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.


Re: Contributing to MLlib: Proposal for Clustering Algorithms

2014-07-18 Thread RJ Nowling
Nice to meet you, Jeremy!

This is great!  Hierarchical clustering was next on my list --
currently trying to get my PR for MiniBatch KMeans accepted.

If it's cool with you, I'll try converting your code to fit in with
the existing MLLib code as you suggest. I also need to review the
Decision Tree code (as suggested above) to see how much of that can be
reused.

Maybe I can ask you to do a code review for me when I'm done?





On Thu, Jul 17, 2014 at 8:31 PM, Jeremy Freeman
freeman.jer...@gmail.com wrote:
 Hi all,

 Cool discussion! I agree that a more standardized API for clustering, and
 easy access to underlying routines, would be useful (we've also been
 discussing this when trying to develop streaming clustering algorithms,
 similar to https://github.com/apache/spark/pull/1361)

 For divisive, hierarchical clustering I implemented something awhile back,
 here's a gist.

 https://gist.github.com/freeman-lab/5947e7c53b368fe90371

 It does bisecting k-means clustering (with k=2), with a recursive class for
 keeping track of the tree. I also found this much better than agglomerative
 methods (for the reasons Hector points out).

 This needs to be cleaned up, and can surely be optimized (esp. by replacing
 the core KMeans step with existing MLLib code), but I can say I was running
 it successfully on quite large data sets.

 RJ, depending on where you are in your progress, I'd be happy to help work
 on this piece and / or have you use this as a jumping off point, if useful.

 -- Jeremy



 --
 View this message in context: 
 http://apache-spark-developers-list.1001551.n3.nabble.com/Contributing-to-MLlib-Proposal-for-Clustering-Algorithms-tp7212p7398.html
 Sent from the Apache Spark Developers List mailing list archive at Nabble.com.



-- 
em rnowl...@gmail.com
c 954.496.2314


Re: Contributing to MLlib: Proposal for Clustering Algorithms

2014-07-17 Thread Jeremy Freeman
Hi all, 

Cool discussion! I agree that a more standardized API for clustering, and
easy access to underlying routines, would be useful (we've also been
discussing this when trying to develop streaming clustering algorithms,
similar to https://github.com/apache/spark/pull/1361) 

For divisive, hierarchical clustering I implemented something awhile back,
here's a gist. 

https://gist.github.com/freeman-lab/5947e7c53b368fe90371

It does bisecting k-means clustering (with k=2), with a recursive class for
keeping track of the tree. I also found this much better than agglomerative
methods (for the reasons Hector points out).

This needs to be cleaned up, and can surely be optimized (esp. by replacing
the core KMeans step with existing MLLib code), but I can say I was running
it successfully on quite large data sets. 

RJ, depending on where you are in your progress, I'd be happy to help work
on this piece and / or have you use this as a jumping off point, if useful. 

-- Jeremy 



--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/Contributing-to-MLlib-Proposal-for-Clustering-Algorithms-tp7212p7398.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.


Re: Contributing to MLlib: Proposal for Clustering Algorithms

2014-07-10 Thread RJ Nowling
I went ahead and created JIRAs.

JIRA for Hierarchical Clustering:
https://issues.apache.org/jira/browse/SPARK-2429

JIRA for Standarized Clustering APIs:
https://issues.apache.org/jira/browse/SPARK-2430

Before submitting a PR for the standardized API, I want to implement a
few clustering algorithms for myself to get a good feel for how to
structure such an API.  If others with more experience want to dive
into designing the API, though, that would allow us to get moving more
quickly.


On Wed, Jul 9, 2014 at 8:39 AM, Nick Pentreath nick.pentre...@gmail.com wrote:
 Cool seems like a god initiative. Adding a couple extra high quality 
 clustering implantations will be great.


 I'd say it would make most sense to submit a PR for the Standardised API 
 first, agree that with everyone and then build on it for the specific 
 implementations.
 —
 Sent from Mailbox

 On Wed, Jul 9, 2014 at 2:15 PM, RJ Nowling rnowl...@gmail.com wrote:

 Thanks everyone for the input.
 So it seems what people want is:
 * Implement MiniBatch KMeans and Hierarchical KMeans (Divide and
 conquer approach, look at DecisionTree implementation as a reference)
 * Restructure 3 Kmeans clustering algorithm implementations to prevent
 code duplication and conform to a consistent API where possible
 If this is correct, I'll start work on that.  How would it be best to
 structure it? Should I submit separate JIRAs / PRs for refactoring of
 current code, MiniBatch KMeans, and Hierarchical or keep my current
 JIRA and PR for MiniBatch KMeans open and submit a second JIRA and PR
 for Hierarchical KMeans that builds on top of that?
 Thanks!
 On Tue, Jul 8, 2014 at 5:44 PM, Hector Yee hector@gmail.com wrote:
 Yeah if one were to replace the objective function in decision tree with
 minimizing the variance of the leaf nodes it would be a hierarchical
 clusterer.


 On Tue, Jul 8, 2014 at 2:12 PM, Evan R. Sparks evan.spa...@gmail.com
 wrote:

 If you're thinking along these lines, have a look at the DecisionTree
 implementation in MLlib. It uses the same idea and is optimized to prevent
 multiple passes over the data by computing several splits at each level of
 tree building. The tradeoff is increased model state and computation per
 pass over the data, but fewer total passes and hopefully lower
 communication overheads than, say, shuffling data around that belongs to
 one cluster or another. Something like that could work here as well.

 I'm not super-familiar with hierarchical K-Means so perhaps there's a more
 efficient way to implement it, though.


 On Tue, Jul 8, 2014 at 2:06 PM, Hector Yee hector@gmail.com wrote:

  No was thinking more top-down:
 
  assuming a distributed kmeans system already existing, recursively apply
  the kmeans algorithm on data already partitioned by the previous level of
  kmeans.
 
  I haven't been much of a fan of bottom up approaches like HAC mainly
  because they assume there is already a distance metric for items to
 items.
  This makes it hard to cluster new content. The distances between sibling
  clusters is also hard to compute (if you have thrown away the similarity
  matrix), do you count paths to same parent node if you are computing
  distances between items in two adjacent nodes for example. It is also a
 bit
  harder to distribute the computation for bottom up approaches as you have
  to already find the nearest neighbor to an item to begin the process.
 
 
  On Tue, Jul 8, 2014 at 1:59 PM, RJ Nowling rnowl...@gmail.com wrote:
 
   The scikit-learn implementation may be of interest:
  
  
  
 
 http://scikit-learn.org/stable/modules/generated/sklearn.cluster.Ward.html#sklearn.cluster.Ward
  
   It's a bottom up approach.  The pair of clusters for merging are
   chosen to minimize variance.
  
   Their code is under a BSD license so it can be used as a template.
  
   Is something like that you were thinking Hector?
  
   On Tue, Jul 8, 2014 at 4:50 PM, Dmitriy Lyubimov dlie...@gmail.com
   wrote:
sure. more interesting problem here is choosing k at each level.
 Kernel
methods seem to be most promising.
   
   
On Tue, Jul 8, 2014 at 1:31 PM, Hector Yee hector@gmail.com
  wrote:
   
No idea, never looked it up. Always just implemented it as doing
  k-means
again on each cluster.
   
FWIW standard k-means with euclidean distance has problems too with
  some
dimensionality reduction methods. Swapping out the distance metric
  with
negative dot or cosine may help.
   
Other more useful clustering would be hierarchical SVD. The reason
  why I
like hierarchical clustering is it makes for faster inference
  especially
over billions of users.
   
   
On Tue, Jul 8, 2014 at 1:24 PM, Dmitriy Lyubimov dlie...@gmail.com
 
wrote:
   
 Hector, could you share the references for hierarchical K-means?
   thanks.


 On Tue, Jul 8, 2014 at 1:01 PM, Hector Yee hector@gmail.com
   wrote:

  I would say for bigdata applications 

Re: Contributing to MLlib: Proposal for Clustering Algorithms

2014-07-10 Thread Nick Pentreath
Might be worth checking out scikit-learn and mahout to get some broad ideas—
Sent from Mailbox

On Thu, Jul 10, 2014 at 4:25 PM, RJ Nowling rnowl...@gmail.com wrote:

 I went ahead and created JIRAs.
 JIRA for Hierarchical Clustering:
 https://issues.apache.org/jira/browse/SPARK-2429
 JIRA for Standarized Clustering APIs:
 https://issues.apache.org/jira/browse/SPARK-2430
 Before submitting a PR for the standardized API, I want to implement a
 few clustering algorithms for myself to get a good feel for how to
 structure such an API.  If others with more experience want to dive
 into designing the API, though, that would allow us to get moving more
 quickly.
 On Wed, Jul 9, 2014 at 8:39 AM, Nick Pentreath nick.pentre...@gmail.com 
 wrote:
 Cool seems like a god initiative. Adding a couple extra high quality 
 clustering implantations will be great.


 I'd say it would make most sense to submit a PR for the Standardised API 
 first, agree that with everyone and then build on it for the specific 
 implementations.
 —
 Sent from Mailbox

 On Wed, Jul 9, 2014 at 2:15 PM, RJ Nowling rnowl...@gmail.com wrote:

 Thanks everyone for the input.
 So it seems what people want is:
 * Implement MiniBatch KMeans and Hierarchical KMeans (Divide and
 conquer approach, look at DecisionTree implementation as a reference)
 * Restructure 3 Kmeans clustering algorithm implementations to prevent
 code duplication and conform to a consistent API where possible
 If this is correct, I'll start work on that.  How would it be best to
 structure it? Should I submit separate JIRAs / PRs for refactoring of
 current code, MiniBatch KMeans, and Hierarchical or keep my current
 JIRA and PR for MiniBatch KMeans open and submit a second JIRA and PR
 for Hierarchical KMeans that builds on top of that?
 Thanks!
 On Tue, Jul 8, 2014 at 5:44 PM, Hector Yee hector@gmail.com wrote:
 Yeah if one were to replace the objective function in decision tree with
 minimizing the variance of the leaf nodes it would be a hierarchical
 clusterer.


 On Tue, Jul 8, 2014 at 2:12 PM, Evan R. Sparks evan.spa...@gmail.com
 wrote:

 If you're thinking along these lines, have a look at the DecisionTree
 implementation in MLlib. It uses the same idea and is optimized to prevent
 multiple passes over the data by computing several splits at each level of
 tree building. The tradeoff is increased model state and computation per
 pass over the data, but fewer total passes and hopefully lower
 communication overheads than, say, shuffling data around that belongs to
 one cluster or another. Something like that could work here as well.

 I'm not super-familiar with hierarchical K-Means so perhaps there's a more
 efficient way to implement it, though.


 On Tue, Jul 8, 2014 at 2:06 PM, Hector Yee hector@gmail.com wrote:

  No was thinking more top-down:
 
  assuming a distributed kmeans system already existing, recursively apply
  the kmeans algorithm on data already partitioned by the previous level 
  of
  kmeans.
 
  I haven't been much of a fan of bottom up approaches like HAC mainly
  because they assume there is already a distance metric for items to
 items.
  This makes it hard to cluster new content. The distances between sibling
  clusters is also hard to compute (if you have thrown away the similarity
  matrix), do you count paths to same parent node if you are computing
  distances between items in two adjacent nodes for example. It is also a
 bit
  harder to distribute the computation for bottom up approaches as you 
  have
  to already find the nearest neighbor to an item to begin the process.
 
 
  On Tue, Jul 8, 2014 at 1:59 PM, RJ Nowling rnowl...@gmail.com wrote:
 
   The scikit-learn implementation may be of interest:
  
  
  
 
 http://scikit-learn.org/stable/modules/generated/sklearn.cluster.Ward.html#sklearn.cluster.Ward
  
   It's a bottom up approach.  The pair of clusters for merging are
   chosen to minimize variance.
  
   Their code is under a BSD license so it can be used as a template.
  
   Is something like that you were thinking Hector?
  
   On Tue, Jul 8, 2014 at 4:50 PM, Dmitriy Lyubimov dlie...@gmail.com
   wrote:
sure. more interesting problem here is choosing k at each level.
 Kernel
methods seem to be most promising.
   
   
On Tue, Jul 8, 2014 at 1:31 PM, Hector Yee hector@gmail.com
  wrote:
   
No idea, never looked it up. Always just implemented it as doing
  k-means
again on each cluster.
   
FWIW standard k-means with euclidean distance has problems too with
  some
dimensionality reduction methods. Swapping out the distance metric
  with
negative dot or cosine may help.
   
Other more useful clustering would be hierarchical SVD. The reason
  why I
like hierarchical clustering is it makes for faster inference
  especially
over billions of users.
   
   
On Tue, Jul 8, 2014 at 1:24 PM, Dmitriy Lyubimov dlie...@gmail.com
 
wrote:
   
 Hector, could you share the 

Re: Contributing to MLlib: Proposal for Clustering Algorithms

2014-07-09 Thread RJ Nowling
Thanks everyone for the input.

So it seems what people want is:

* Implement MiniBatch KMeans and Hierarchical KMeans (Divide and
conquer approach, look at DecisionTree implementation as a reference)
* Restructure 3 Kmeans clustering algorithm implementations to prevent
code duplication and conform to a consistent API where possible

If this is correct, I'll start work on that.  How would it be best to
structure it? Should I submit separate JIRAs / PRs for refactoring of
current code, MiniBatch KMeans, and Hierarchical or keep my current
JIRA and PR for MiniBatch KMeans open and submit a second JIRA and PR
for Hierarchical KMeans that builds on top of that?

Thanks!


On Tue, Jul 8, 2014 at 5:44 PM, Hector Yee hector@gmail.com wrote:
 Yeah if one were to replace the objective function in decision tree with
 minimizing the variance of the leaf nodes it would be a hierarchical
 clusterer.


 On Tue, Jul 8, 2014 at 2:12 PM, Evan R. Sparks evan.spa...@gmail.com
 wrote:

 If you're thinking along these lines, have a look at the DecisionTree
 implementation in MLlib. It uses the same idea and is optimized to prevent
 multiple passes over the data by computing several splits at each level of
 tree building. The tradeoff is increased model state and computation per
 pass over the data, but fewer total passes and hopefully lower
 communication overheads than, say, shuffling data around that belongs to
 one cluster or another. Something like that could work here as well.

 I'm not super-familiar with hierarchical K-Means so perhaps there's a more
 efficient way to implement it, though.


 On Tue, Jul 8, 2014 at 2:06 PM, Hector Yee hector@gmail.com wrote:

  No was thinking more top-down:
 
  assuming a distributed kmeans system already existing, recursively apply
  the kmeans algorithm on data already partitioned by the previous level of
  kmeans.
 
  I haven't been much of a fan of bottom up approaches like HAC mainly
  because they assume there is already a distance metric for items to
 items.
  This makes it hard to cluster new content. The distances between sibling
  clusters is also hard to compute (if you have thrown away the similarity
  matrix), do you count paths to same parent node if you are computing
  distances between items in two adjacent nodes for example. It is also a
 bit
  harder to distribute the computation for bottom up approaches as you have
  to already find the nearest neighbor to an item to begin the process.
 
 
  On Tue, Jul 8, 2014 at 1:59 PM, RJ Nowling rnowl...@gmail.com wrote:
 
   The scikit-learn implementation may be of interest:
  
  
  
 
 http://scikit-learn.org/stable/modules/generated/sklearn.cluster.Ward.html#sklearn.cluster.Ward
  
   It's a bottom up approach.  The pair of clusters for merging are
   chosen to minimize variance.
  
   Their code is under a BSD license so it can be used as a template.
  
   Is something like that you were thinking Hector?
  
   On Tue, Jul 8, 2014 at 4:50 PM, Dmitriy Lyubimov dlie...@gmail.com
   wrote:
sure. more interesting problem here is choosing k at each level.
 Kernel
methods seem to be most promising.
   
   
On Tue, Jul 8, 2014 at 1:31 PM, Hector Yee hector@gmail.com
  wrote:
   
No idea, never looked it up. Always just implemented it as doing
  k-means
again on each cluster.
   
FWIW standard k-means with euclidean distance has problems too with
  some
dimensionality reduction methods. Swapping out the distance metric
  with
negative dot or cosine may help.
   
Other more useful clustering would be hierarchical SVD. The reason
  why I
like hierarchical clustering is it makes for faster inference
  especially
over billions of users.
   
   
On Tue, Jul 8, 2014 at 1:24 PM, Dmitriy Lyubimov dlie...@gmail.com
 
wrote:
   
 Hector, could you share the references for hierarchical K-means?
   thanks.


 On Tue, Jul 8, 2014 at 1:01 PM, Hector Yee hector@gmail.com
   wrote:

  I would say for bigdata applications the most useful would be
 hierarchical
  k-means with back tracking and the ability to support k nearest
 centroids.
 
 
  On Tue, Jul 8, 2014 at 10:54 AM, RJ Nowling rnowl...@gmail.com
 
wrote:
 
   Hi all,
  
   MLlib currently has one clustering algorithm implementation,
   KMeans.
   It would benefit from having implementations of other
 clustering
   algorithms such as MiniBatch KMeans, Fuzzy C-Means,
 Hierarchical
   Clustering, and Affinity Propagation.
  
   I recently submitted a PR [1] for a MiniBatch KMeans
   implementation,
   and I saw an email on this list about interest in implementing
   Fuzzy
   C-Means.
  
   Based on Sean Owen's review of my MiniBatch KMeans code, it
  became
   apparent that before I implement more clustering algorithms,
 it
   would
   be useful to hammer out a framework to reduce code duplication
  and
   

Re: Contributing to MLlib: Proposal for Clustering Algorithms

2014-07-09 Thread Nick Pentreath
Cool seems like a god initiative. Adding a couple extra high quality clustering 
implantations will be great.


I'd say it would make most sense to submit a PR for the Standardised API first, 
agree that with everyone and then build on it for the specific implementations.
—
Sent from Mailbox

On Wed, Jul 9, 2014 at 2:15 PM, RJ Nowling rnowl...@gmail.com wrote:

 Thanks everyone for the input.
 So it seems what people want is:
 * Implement MiniBatch KMeans and Hierarchical KMeans (Divide and
 conquer approach, look at DecisionTree implementation as a reference)
 * Restructure 3 Kmeans clustering algorithm implementations to prevent
 code duplication and conform to a consistent API where possible
 If this is correct, I'll start work on that.  How would it be best to
 structure it? Should I submit separate JIRAs / PRs for refactoring of
 current code, MiniBatch KMeans, and Hierarchical or keep my current
 JIRA and PR for MiniBatch KMeans open and submit a second JIRA and PR
 for Hierarchical KMeans that builds on top of that?
 Thanks!
 On Tue, Jul 8, 2014 at 5:44 PM, Hector Yee hector@gmail.com wrote:
 Yeah if one were to replace the objective function in decision tree with
 minimizing the variance of the leaf nodes it would be a hierarchical
 clusterer.


 On Tue, Jul 8, 2014 at 2:12 PM, Evan R. Sparks evan.spa...@gmail.com
 wrote:

 If you're thinking along these lines, have a look at the DecisionTree
 implementation in MLlib. It uses the same idea and is optimized to prevent
 multiple passes over the data by computing several splits at each level of
 tree building. The tradeoff is increased model state and computation per
 pass over the data, but fewer total passes and hopefully lower
 communication overheads than, say, shuffling data around that belongs to
 one cluster or another. Something like that could work here as well.

 I'm not super-familiar with hierarchical K-Means so perhaps there's a more
 efficient way to implement it, though.


 On Tue, Jul 8, 2014 at 2:06 PM, Hector Yee hector@gmail.com wrote:

  No was thinking more top-down:
 
  assuming a distributed kmeans system already existing, recursively apply
  the kmeans algorithm on data already partitioned by the previous level of
  kmeans.
 
  I haven't been much of a fan of bottom up approaches like HAC mainly
  because they assume there is already a distance metric for items to
 items.
  This makes it hard to cluster new content. The distances between sibling
  clusters is also hard to compute (if you have thrown away the similarity
  matrix), do you count paths to same parent node if you are computing
  distances between items in two adjacent nodes for example. It is also a
 bit
  harder to distribute the computation for bottom up approaches as you have
  to already find the nearest neighbor to an item to begin the process.
 
 
  On Tue, Jul 8, 2014 at 1:59 PM, RJ Nowling rnowl...@gmail.com wrote:
 
   The scikit-learn implementation may be of interest:
  
  
  
 
 http://scikit-learn.org/stable/modules/generated/sklearn.cluster.Ward.html#sklearn.cluster.Ward
  
   It's a bottom up approach.  The pair of clusters for merging are
   chosen to minimize variance.
  
   Their code is under a BSD license so it can be used as a template.
  
   Is something like that you were thinking Hector?
  
   On Tue, Jul 8, 2014 at 4:50 PM, Dmitriy Lyubimov dlie...@gmail.com
   wrote:
sure. more interesting problem here is choosing k at each level.
 Kernel
methods seem to be most promising.
   
   
On Tue, Jul 8, 2014 at 1:31 PM, Hector Yee hector@gmail.com
  wrote:
   
No idea, never looked it up. Always just implemented it as doing
  k-means
again on each cluster.
   
FWIW standard k-means with euclidean distance has problems too with
  some
dimensionality reduction methods. Swapping out the distance metric
  with
negative dot or cosine may help.
   
Other more useful clustering would be hierarchical SVD. The reason
  why I
like hierarchical clustering is it makes for faster inference
  especially
over billions of users.
   
   
On Tue, Jul 8, 2014 at 1:24 PM, Dmitriy Lyubimov dlie...@gmail.com
 
wrote:
   
 Hector, could you share the references for hierarchical K-means?
   thanks.


 On Tue, Jul 8, 2014 at 1:01 PM, Hector Yee hector@gmail.com
   wrote:

  I would say for bigdata applications the most useful would be
 hierarchical
  k-means with back tracking and the ability to support k nearest
 centroids.
 
 
  On Tue, Jul 8, 2014 at 10:54 AM, RJ Nowling rnowl...@gmail.com
 
wrote:
 
   Hi all,
  
   MLlib currently has one clustering algorithm implementation,
   KMeans.
   It would benefit from having implementations of other
 clustering
   algorithms such as MiniBatch KMeans, Fuzzy C-Means,
 Hierarchical
   Clustering, and Affinity Propagation.
  
   I recently submitted a PR [1] for a MiniBatch 

Contributing to MLlib: Proposal for Clustering Algorithms

2014-07-08 Thread RJ Nowling
Hi all,

MLlib currently has one clustering algorithm implementation, KMeans.
It would benefit from having implementations of other clustering
algorithms such as MiniBatch KMeans, Fuzzy C-Means, Hierarchical
Clustering, and Affinity Propagation.

I recently submitted a PR [1] for a MiniBatch KMeans implementation,
and I saw an email on this list about interest in implementing Fuzzy
C-Means.

Based on Sean Owen's review of my MiniBatch KMeans code, it became
apparent that before I implement more clustering algorithms, it would
be useful to hammer out a framework to reduce code duplication and
implement a consistent API.

I'd like to gauge the interest and goals of the MLlib community:

1. Are you interested in having more clustering algorithms available?

2. Is the community interested in specifying a common framework?

Thanks!
RJ

[1] - https://github.com/apache/spark/pull/1248


-- 
em rnowl...@gmail.com
c 954.496.2314


Re: Contributing to MLlib: Proposal for Clustering Algorithms

2014-07-08 Thread Hector Yee
I would say for bigdata applications the most useful would be hierarchical
k-means with back tracking and the ability to support k nearest centroids.


On Tue, Jul 8, 2014 at 10:54 AM, RJ Nowling rnowl...@gmail.com wrote:

 Hi all,

 MLlib currently has one clustering algorithm implementation, KMeans.
 It would benefit from having implementations of other clustering
 algorithms such as MiniBatch KMeans, Fuzzy C-Means, Hierarchical
 Clustering, and Affinity Propagation.

 I recently submitted a PR [1] for a MiniBatch KMeans implementation,
 and I saw an email on this list about interest in implementing Fuzzy
 C-Means.

 Based on Sean Owen's review of my MiniBatch KMeans code, it became
 apparent that before I implement more clustering algorithms, it would
 be useful to hammer out a framework to reduce code duplication and
 implement a consistent API.

 I'd like to gauge the interest and goals of the MLlib community:

 1. Are you interested in having more clustering algorithms available?

 2. Is the community interested in specifying a common framework?

 Thanks!
 RJ

 [1] - https://github.com/apache/spark/pull/1248


 --
 em rnowl...@gmail.com
 c 954.496.2314




-- 
Yee Yang Li Hector http://google.com/+HectorYee
*google.com/+HectorYee http://google.com/+HectorYee*


Re: Contributing to MLlib: Proposal for Clustering Algorithms

2014-07-08 Thread RJ Nowling
Thanks, Hector! Your feedback is useful.

On Tuesday, July 8, 2014, Hector Yee hector@gmail.com wrote:

 I would say for bigdata applications the most useful would be hierarchical
 k-means with back tracking and the ability to support k nearest centroids.


 On Tue, Jul 8, 2014 at 10:54 AM, RJ Nowling rnowl...@gmail.com
 javascript:; wrote:

  Hi all,
 
  MLlib currently has one clustering algorithm implementation, KMeans.
  It would benefit from having implementations of other clustering
  algorithms such as MiniBatch KMeans, Fuzzy C-Means, Hierarchical
  Clustering, and Affinity Propagation.
 
  I recently submitted a PR [1] for a MiniBatch KMeans implementation,
  and I saw an email on this list about interest in implementing Fuzzy
  C-Means.
 
  Based on Sean Owen's review of my MiniBatch KMeans code, it became
  apparent that before I implement more clustering algorithms, it would
  be useful to hammer out a framework to reduce code duplication and
  implement a consistent API.
 
  I'd like to gauge the interest and goals of the MLlib community:
 
  1. Are you interested in having more clustering algorithms available?
 
  2. Is the community interested in specifying a common framework?
 
  Thanks!
  RJ
 
  [1] - https://github.com/apache/spark/pull/1248
 
 
  --
  em rnowl...@gmail.com javascript:;
  c 954.496.2314
 



 --
 Yee Yang Li Hector http://google.com/+HectorYee
 *google.com/+HectorYee http://google.com/+HectorYee*



-- 
em rnowl...@gmail.com
c 954.496.2314


Re: Contributing to MLlib: Proposal for Clustering Algorithms

2014-07-08 Thread Dmitriy Lyubimov
Hector, could you share the references for hierarchical K-means? thanks.


On Tue, Jul 8, 2014 at 1:01 PM, Hector Yee hector@gmail.com wrote:

 I would say for bigdata applications the most useful would be hierarchical
 k-means with back tracking and the ability to support k nearest centroids.


 On Tue, Jul 8, 2014 at 10:54 AM, RJ Nowling rnowl...@gmail.com wrote:

  Hi all,
 
  MLlib currently has one clustering algorithm implementation, KMeans.
  It would benefit from having implementations of other clustering
  algorithms such as MiniBatch KMeans, Fuzzy C-Means, Hierarchical
  Clustering, and Affinity Propagation.
 
  I recently submitted a PR [1] for a MiniBatch KMeans implementation,
  and I saw an email on this list about interest in implementing Fuzzy
  C-Means.
 
  Based on Sean Owen's review of my MiniBatch KMeans code, it became
  apparent that before I implement more clustering algorithms, it would
  be useful to hammer out a framework to reduce code duplication and
  implement a consistent API.
 
  I'd like to gauge the interest and goals of the MLlib community:
 
  1. Are you interested in having more clustering algorithms available?
 
  2. Is the community interested in specifying a common framework?
 
  Thanks!
  RJ
 
  [1] - https://github.com/apache/spark/pull/1248
 
 
  --
  em rnowl...@gmail.com
  c 954.496.2314
 



 --
 Yee Yang Li Hector http://google.com/+HectorYee
 *google.com/+HectorYee http://google.com/+HectorYee*



Re: Contributing to MLlib: Proposal for Clustering Algorithms

2014-07-08 Thread Sandy Ryza
Having a common framework for clustering makes sense to me.  While we
should be careful about what algorithms we include, having solid
implementations of minibatch clustering and hierarchical clustering seems
like a worthwhile goal, and we should reuse as much code and APIs as
reasonable.


On Tue, Jul 8, 2014 at 1:19 PM, RJ Nowling rnowl...@gmail.com wrote:

 Thanks, Hector! Your feedback is useful.

 On Tuesday, July 8, 2014, Hector Yee hector@gmail.com wrote:

  I would say for bigdata applications the most useful would be
 hierarchical
  k-means with back tracking and the ability to support k nearest
 centroids.
 
 
  On Tue, Jul 8, 2014 at 10:54 AM, RJ Nowling rnowl...@gmail.com
  javascript:; wrote:
 
   Hi all,
  
   MLlib currently has one clustering algorithm implementation, KMeans.
   It would benefit from having implementations of other clustering
   algorithms such as MiniBatch KMeans, Fuzzy C-Means, Hierarchical
   Clustering, and Affinity Propagation.
  
   I recently submitted a PR [1] for a MiniBatch KMeans implementation,
   and I saw an email on this list about interest in implementing Fuzzy
   C-Means.
  
   Based on Sean Owen's review of my MiniBatch KMeans code, it became
   apparent that before I implement more clustering algorithms, it would
   be useful to hammer out a framework to reduce code duplication and
   implement a consistent API.
  
   I'd like to gauge the interest and goals of the MLlib community:
  
   1. Are you interested in having more clustering algorithms available?
  
   2. Is the community interested in specifying a common framework?
  
   Thanks!
   RJ
  
   [1] - https://github.com/apache/spark/pull/1248
  
  
   --
   em rnowl...@gmail.com javascript:;
   c 954.496.2314
  
 
 
 
  --
  Yee Yang Li Hector http://google.com/+HectorYee
  *google.com/+HectorYee http://google.com/+HectorYee*
 


 --
 em rnowl...@gmail.com
 c 954.496.2314



Re: Contributing to MLlib: Proposal for Clustering Algorithms

2014-07-08 Thread Hector Yee
No idea, never looked it up. Always just implemented it as doing k-means
again on each cluster.

FWIW standard k-means with euclidean distance has problems too with some
dimensionality reduction methods. Swapping out the distance metric with
negative dot or cosine may help.

Other more useful clustering would be hierarchical SVD. The reason why I
like hierarchical clustering is it makes for faster inference especially
over billions of users.


On Tue, Jul 8, 2014 at 1:24 PM, Dmitriy Lyubimov dlie...@gmail.com wrote:

 Hector, could you share the references for hierarchical K-means? thanks.


 On Tue, Jul 8, 2014 at 1:01 PM, Hector Yee hector@gmail.com wrote:

  I would say for bigdata applications the most useful would be
 hierarchical
  k-means with back tracking and the ability to support k nearest
 centroids.
 
 
  On Tue, Jul 8, 2014 at 10:54 AM, RJ Nowling rnowl...@gmail.com wrote:
 
   Hi all,
  
   MLlib currently has one clustering algorithm implementation, KMeans.
   It would benefit from having implementations of other clustering
   algorithms such as MiniBatch KMeans, Fuzzy C-Means, Hierarchical
   Clustering, and Affinity Propagation.
  
   I recently submitted a PR [1] for a MiniBatch KMeans implementation,
   and I saw an email on this list about interest in implementing Fuzzy
   C-Means.
  
   Based on Sean Owen's review of my MiniBatch KMeans code, it became
   apparent that before I implement more clustering algorithms, it would
   be useful to hammer out a framework to reduce code duplication and
   implement a consistent API.
  
   I'd like to gauge the interest and goals of the MLlib community:
  
   1. Are you interested in having more clustering algorithms available?
  
   2. Is the community interested in specifying a common framework?
  
   Thanks!
   RJ
  
   [1] - https://github.com/apache/spark/pull/1248
  
  
   --
   em rnowl...@gmail.com
   c 954.496.2314
  
 
 
 
  --
  Yee Yang Li Hector http://google.com/+HectorYee
  *google.com/+HectorYee http://google.com/+HectorYee*
 




-- 
Yee Yang Li Hector http://google.com/+HectorYee
*google.com/+HectorYee http://google.com/+HectorYee*


Re: Contributing to MLlib: Proposal for Clustering Algorithms

2014-07-08 Thread Dmitriy Lyubimov
sure. more interesting problem here is choosing k at each level. Kernel
methods seem to be most promising.


On Tue, Jul 8, 2014 at 1:31 PM, Hector Yee hector@gmail.com wrote:

 No idea, never looked it up. Always just implemented it as doing k-means
 again on each cluster.

 FWIW standard k-means with euclidean distance has problems too with some
 dimensionality reduction methods. Swapping out the distance metric with
 negative dot or cosine may help.

 Other more useful clustering would be hierarchical SVD. The reason why I
 like hierarchical clustering is it makes for faster inference especially
 over billions of users.


 On Tue, Jul 8, 2014 at 1:24 PM, Dmitriy Lyubimov dlie...@gmail.com
 wrote:

  Hector, could you share the references for hierarchical K-means? thanks.
 
 
  On Tue, Jul 8, 2014 at 1:01 PM, Hector Yee hector@gmail.com wrote:
 
   I would say for bigdata applications the most useful would be
  hierarchical
   k-means with back tracking and the ability to support k nearest
  centroids.
  
  
   On Tue, Jul 8, 2014 at 10:54 AM, RJ Nowling rnowl...@gmail.com
 wrote:
  
Hi all,
   
MLlib currently has one clustering algorithm implementation, KMeans.
It would benefit from having implementations of other clustering
algorithms such as MiniBatch KMeans, Fuzzy C-Means, Hierarchical
Clustering, and Affinity Propagation.
   
I recently submitted a PR [1] for a MiniBatch KMeans implementation,
and I saw an email on this list about interest in implementing Fuzzy
C-Means.
   
Based on Sean Owen's review of my MiniBatch KMeans code, it became
apparent that before I implement more clustering algorithms, it would
be useful to hammer out a framework to reduce code duplication and
implement a consistent API.
   
I'd like to gauge the interest and goals of the MLlib community:
   
1. Are you interested in having more clustering algorithms available?
   
2. Is the community interested in specifying a common framework?
   
Thanks!
RJ
   
[1] - https://github.com/apache/spark/pull/1248
   
   
--
em rnowl...@gmail.com
c 954.496.2314
   
  
  
  
   --
   Yee Yang Li Hector http://google.com/+HectorYee
   *google.com/+HectorYee http://google.com/+HectorYee*
  
 



 --
 Yee Yang Li Hector http://google.com/+HectorYee
 *google.com/+HectorYee http://google.com/+HectorYee*



Re: Contributing to MLlib: Proposal for Clustering Algorithms

2014-07-08 Thread RJ Nowling
The scikit-learn implementation may be of interest:

http://scikit-learn.org/stable/modules/generated/sklearn.cluster.Ward.html#sklearn.cluster.Ward

It's a bottom up approach.  The pair of clusters for merging are
chosen to minimize variance.

Their code is under a BSD license so it can be used as a template.

Is something like that you were thinking Hector?

On Tue, Jul 8, 2014 at 4:50 PM, Dmitriy Lyubimov dlie...@gmail.com wrote:
 sure. more interesting problem here is choosing k at each level. Kernel
 methods seem to be most promising.


 On Tue, Jul 8, 2014 at 1:31 PM, Hector Yee hector@gmail.com wrote:

 No idea, never looked it up. Always just implemented it as doing k-means
 again on each cluster.

 FWIW standard k-means with euclidean distance has problems too with some
 dimensionality reduction methods. Swapping out the distance metric with
 negative dot or cosine may help.

 Other more useful clustering would be hierarchical SVD. The reason why I
 like hierarchical clustering is it makes for faster inference especially
 over billions of users.


 On Tue, Jul 8, 2014 at 1:24 PM, Dmitriy Lyubimov dlie...@gmail.com
 wrote:

  Hector, could you share the references for hierarchical K-means? thanks.
 
 
  On Tue, Jul 8, 2014 at 1:01 PM, Hector Yee hector@gmail.com wrote:
 
   I would say for bigdata applications the most useful would be
  hierarchical
   k-means with back tracking and the ability to support k nearest
  centroids.
  
  
   On Tue, Jul 8, 2014 at 10:54 AM, RJ Nowling rnowl...@gmail.com
 wrote:
  
Hi all,
   
MLlib currently has one clustering algorithm implementation, KMeans.
It would benefit from having implementations of other clustering
algorithms such as MiniBatch KMeans, Fuzzy C-Means, Hierarchical
Clustering, and Affinity Propagation.
   
I recently submitted a PR [1] for a MiniBatch KMeans implementation,
and I saw an email on this list about interest in implementing Fuzzy
C-Means.
   
Based on Sean Owen's review of my MiniBatch KMeans code, it became
apparent that before I implement more clustering algorithms, it would
be useful to hammer out a framework to reduce code duplication and
implement a consistent API.
   
I'd like to gauge the interest and goals of the MLlib community:
   
1. Are you interested in having more clustering algorithms available?
   
2. Is the community interested in specifying a common framework?
   
Thanks!
RJ
   
[1] - https://github.com/apache/spark/pull/1248
   
   
--
em rnowl...@gmail.com
c 954.496.2314
   
  
  
  
   --
   Yee Yang Li Hector http://google.com/+HectorYee
   *google.com/+HectorYee http://google.com/+HectorYee*
  
 



 --
 Yee Yang Li Hector http://google.com/+HectorYee
 *google.com/+HectorYee http://google.com/+HectorYee*




-- 
em rnowl...@gmail.com
c 954.496.2314


Re: Contributing to MLlib: Proposal for Clustering Algorithms

2014-07-08 Thread Hector Yee
K doesn't matter much I've tried anything from 2^10 to 10^3 and the
performance
doesn't change much as measured by precision @ K. (see table 1
http://machinelearning.wustl.edu/mlpapers/papers/weston13). Although 10^3
kmeans did outperform 2^10 hierarchical SVD slightly in terms of the
metrics, 2^10 SVD was much faster in terms of inference time.

I found the thing that affected performance most was adding in back
tracking to fix mistakes made at higher levels rather than how the K is
picked per level.



On Tue, Jul 8, 2014 at 1:50 PM, Dmitriy Lyubimov dlie...@gmail.com wrote:

 sure. more interesting problem here is choosing k at each level. Kernel
 methods seem to be most promising.


 On Tue, Jul 8, 2014 at 1:31 PM, Hector Yee hector@gmail.com wrote:

  No idea, never looked it up. Always just implemented it as doing k-means
  again on each cluster.
 
  FWIW standard k-means with euclidean distance has problems too with some
  dimensionality reduction methods. Swapping out the distance metric with
  negative dot or cosine may help.
 
  Other more useful clustering would be hierarchical SVD. The reason why I
  like hierarchical clustering is it makes for faster inference especially
  over billions of users.
 
 
  On Tue, Jul 8, 2014 at 1:24 PM, Dmitriy Lyubimov dlie...@gmail.com
  wrote:
 
   Hector, could you share the references for hierarchical K-means?
 thanks.
  
  
   On Tue, Jul 8, 2014 at 1:01 PM, Hector Yee hector@gmail.com
 wrote:
  
I would say for bigdata applications the most useful would be
   hierarchical
k-means with back tracking and the ability to support k nearest
   centroids.
   
   
On Tue, Jul 8, 2014 at 10:54 AM, RJ Nowling rnowl...@gmail.com
  wrote:
   
 Hi all,

 MLlib currently has one clustering algorithm implementation,
 KMeans.
 It would benefit from having implementations of other clustering
 algorithms such as MiniBatch KMeans, Fuzzy C-Means, Hierarchical
 Clustering, and Affinity Propagation.

 I recently submitted a PR [1] for a MiniBatch KMeans
 implementation,
 and I saw an email on this list about interest in implementing
 Fuzzy
 C-Means.

 Based on Sean Owen's review of my MiniBatch KMeans code, it became
 apparent that before I implement more clustering algorithms, it
 would
 be useful to hammer out a framework to reduce code duplication and
 implement a consistent API.

 I'd like to gauge the interest and goals of the MLlib community:

 1. Are you interested in having more clustering algorithms
 available?

 2. Is the community interested in specifying a common framework?

 Thanks!
 RJ

 [1] - https://github.com/apache/spark/pull/1248


 --
 em rnowl...@gmail.com
 c 954.496.2314

   
   
   
--
Yee Yang Li Hector http://google.com/+HectorYee
*google.com/+HectorYee http://google.com/+HectorYee*
   
  
 
 
 
  --
  Yee Yang Li Hector http://google.com/+HectorYee
  *google.com/+HectorYee http://google.com/+HectorYee*
 




-- 
Yee Yang Li Hector http://google.com/+HectorYee
*google.com/+HectorYee http://google.com/+HectorYee*


Re: Contributing to MLlib: Proposal for Clustering Algorithms

2014-07-08 Thread Hector Yee
No was thinking more top-down:

assuming a distributed kmeans system already existing, recursively apply
the kmeans algorithm on data already partitioned by the previous level of
kmeans.

I haven't been much of a fan of bottom up approaches like HAC mainly
because they assume there is already a distance metric for items to items.
This makes it hard to cluster new content. The distances between sibling
clusters is also hard to compute (if you have thrown away the similarity
matrix), do you count paths to same parent node if you are computing
distances between items in two adjacent nodes for example. It is also a bit
harder to distribute the computation for bottom up approaches as you have
to already find the nearest neighbor to an item to begin the process.


On Tue, Jul 8, 2014 at 1:59 PM, RJ Nowling rnowl...@gmail.com wrote:

 The scikit-learn implementation may be of interest:


 http://scikit-learn.org/stable/modules/generated/sklearn.cluster.Ward.html#sklearn.cluster.Ward

 It's a bottom up approach.  The pair of clusters for merging are
 chosen to minimize variance.

 Their code is under a BSD license so it can be used as a template.

 Is something like that you were thinking Hector?

 On Tue, Jul 8, 2014 at 4:50 PM, Dmitriy Lyubimov dlie...@gmail.com
 wrote:
  sure. more interesting problem here is choosing k at each level. Kernel
  methods seem to be most promising.
 
 
  On Tue, Jul 8, 2014 at 1:31 PM, Hector Yee hector@gmail.com wrote:
 
  No idea, never looked it up. Always just implemented it as doing k-means
  again on each cluster.
 
  FWIW standard k-means with euclidean distance has problems too with some
  dimensionality reduction methods. Swapping out the distance metric with
  negative dot or cosine may help.
 
  Other more useful clustering would be hierarchical SVD. The reason why I
  like hierarchical clustering is it makes for faster inference especially
  over billions of users.
 
 
  On Tue, Jul 8, 2014 at 1:24 PM, Dmitriy Lyubimov dlie...@gmail.com
  wrote:
 
   Hector, could you share the references for hierarchical K-means?
 thanks.
  
  
   On Tue, Jul 8, 2014 at 1:01 PM, Hector Yee hector@gmail.com
 wrote:
  
I would say for bigdata applications the most useful would be
   hierarchical
k-means with back tracking and the ability to support k nearest
   centroids.
   
   
On Tue, Jul 8, 2014 at 10:54 AM, RJ Nowling rnowl...@gmail.com
  wrote:
   
 Hi all,

 MLlib currently has one clustering algorithm implementation,
 KMeans.
 It would benefit from having implementations of other clustering
 algorithms such as MiniBatch KMeans, Fuzzy C-Means, Hierarchical
 Clustering, and Affinity Propagation.

 I recently submitted a PR [1] for a MiniBatch KMeans
 implementation,
 and I saw an email on this list about interest in implementing
 Fuzzy
 C-Means.

 Based on Sean Owen's review of my MiniBatch KMeans code, it became
 apparent that before I implement more clustering algorithms, it
 would
 be useful to hammer out a framework to reduce code duplication and
 implement a consistent API.

 I'd like to gauge the interest and goals of the MLlib community:

 1. Are you interested in having more clustering algorithms
 available?

 2. Is the community interested in specifying a common framework?

 Thanks!
 RJ

 [1] - https://github.com/apache/spark/pull/1248


 --
 em rnowl...@gmail.com
 c 954.496.2314

   
   
   
--
Yee Yang Li Hector http://google.com/+HectorYee
*google.com/+HectorYee http://google.com/+HectorYee*
   
  
 
 
 
  --
  Yee Yang Li Hector http://google.com/+HectorYee
  *google.com/+HectorYee http://google.com/+HectorYee*
 



 --
 em rnowl...@gmail.com
 c 954.496.2314




-- 
Yee Yang Li Hector http://google.com/+HectorYee
*google.com/+HectorYee http://google.com/+HectorYee*


Re: Contributing to MLlib: Proposal for Clustering Algorithms

2014-07-08 Thread Evan R. Sparks
If you're thinking along these lines, have a look at the DecisionTree
implementation in MLlib. It uses the same idea and is optimized to prevent
multiple passes over the data by computing several splits at each level of
tree building. The tradeoff is increased model state and computation per
pass over the data, but fewer total passes and hopefully lower
communication overheads than, say, shuffling data around that belongs to
one cluster or another. Something like that could work here as well.

I'm not super-familiar with hierarchical K-Means so perhaps there's a more
efficient way to implement it, though.


On Tue, Jul 8, 2014 at 2:06 PM, Hector Yee hector@gmail.com wrote:

 No was thinking more top-down:

 assuming a distributed kmeans system already existing, recursively apply
 the kmeans algorithm on data already partitioned by the previous level of
 kmeans.

 I haven't been much of a fan of bottom up approaches like HAC mainly
 because they assume there is already a distance metric for items to items.
 This makes it hard to cluster new content. The distances between sibling
 clusters is also hard to compute (if you have thrown away the similarity
 matrix), do you count paths to same parent node if you are computing
 distances between items in two adjacent nodes for example. It is also a bit
 harder to distribute the computation for bottom up approaches as you have
 to already find the nearest neighbor to an item to begin the process.


 On Tue, Jul 8, 2014 at 1:59 PM, RJ Nowling rnowl...@gmail.com wrote:

  The scikit-learn implementation may be of interest:
 
 
 
 http://scikit-learn.org/stable/modules/generated/sklearn.cluster.Ward.html#sklearn.cluster.Ward
 
  It's a bottom up approach.  The pair of clusters for merging are
  chosen to minimize variance.
 
  Their code is under a BSD license so it can be used as a template.
 
  Is something like that you were thinking Hector?
 
  On Tue, Jul 8, 2014 at 4:50 PM, Dmitriy Lyubimov dlie...@gmail.com
  wrote:
   sure. more interesting problem here is choosing k at each level. Kernel
   methods seem to be most promising.
  
  
   On Tue, Jul 8, 2014 at 1:31 PM, Hector Yee hector@gmail.com
 wrote:
  
   No idea, never looked it up. Always just implemented it as doing
 k-means
   again on each cluster.
  
   FWIW standard k-means with euclidean distance has problems too with
 some
   dimensionality reduction methods. Swapping out the distance metric
 with
   negative dot or cosine may help.
  
   Other more useful clustering would be hierarchical SVD. The reason
 why I
   like hierarchical clustering is it makes for faster inference
 especially
   over billions of users.
  
  
   On Tue, Jul 8, 2014 at 1:24 PM, Dmitriy Lyubimov dlie...@gmail.com
   wrote:
  
Hector, could you share the references for hierarchical K-means?
  thanks.
   
   
On Tue, Jul 8, 2014 at 1:01 PM, Hector Yee hector@gmail.com
  wrote:
   
 I would say for bigdata applications the most useful would be
hierarchical
 k-means with back tracking and the ability to support k nearest
centroids.


 On Tue, Jul 8, 2014 at 10:54 AM, RJ Nowling rnowl...@gmail.com
   wrote:

  Hi all,
 
  MLlib currently has one clustering algorithm implementation,
  KMeans.
  It would benefit from having implementations of other clustering
  algorithms such as MiniBatch KMeans, Fuzzy C-Means, Hierarchical
  Clustering, and Affinity Propagation.
 
  I recently submitted a PR [1] for a MiniBatch KMeans
  implementation,
  and I saw an email on this list about interest in implementing
  Fuzzy
  C-Means.
 
  Based on Sean Owen's review of my MiniBatch KMeans code, it
 became
  apparent that before I implement more clustering algorithms, it
  would
  be useful to hammer out a framework to reduce code duplication
 and
  implement a consistent API.
 
  I'd like to gauge the interest and goals of the MLlib community:
 
  1. Are you interested in having more clustering algorithms
  available?
 
  2. Is the community interested in specifying a common framework?
 
  Thanks!
  RJ
 
  [1] - https://github.com/apache/spark/pull/1248
 
 
  --
  em rnowl...@gmail.com
  c 954.496.2314
 



 --
 Yee Yang Li Hector http://google.com/+HectorYee
 *google.com/+HectorYee http://google.com/+HectorYee*

   
  
  
  
   --
   Yee Yang Li Hector http://google.com/+HectorYee
   *google.com/+HectorYee http://google.com/+HectorYee*
  
 
 
 
  --
  em rnowl...@gmail.com
  c 954.496.2314
 



 --
 Yee Yang Li Hector http://google.com/+HectorYee
 *google.com/+HectorYee http://google.com/+HectorYee*



Re: Contributing to MLlib: Proposal for Clustering Algorithms

2014-07-08 Thread Hector Yee
Yeah if one were to replace the objective function in decision tree with
minimizing the variance of the leaf nodes it would be a hierarchical
clusterer.


On Tue, Jul 8, 2014 at 2:12 PM, Evan R. Sparks evan.spa...@gmail.com
wrote:

 If you're thinking along these lines, have a look at the DecisionTree
 implementation in MLlib. It uses the same idea and is optimized to prevent
 multiple passes over the data by computing several splits at each level of
 tree building. The tradeoff is increased model state and computation per
 pass over the data, but fewer total passes and hopefully lower
 communication overheads than, say, shuffling data around that belongs to
 one cluster or another. Something like that could work here as well.

 I'm not super-familiar with hierarchical K-Means so perhaps there's a more
 efficient way to implement it, though.


 On Tue, Jul 8, 2014 at 2:06 PM, Hector Yee hector@gmail.com wrote:

  No was thinking more top-down:
 
  assuming a distributed kmeans system already existing, recursively apply
  the kmeans algorithm on data already partitioned by the previous level of
  kmeans.
 
  I haven't been much of a fan of bottom up approaches like HAC mainly
  because they assume there is already a distance metric for items to
 items.
  This makes it hard to cluster new content. The distances between sibling
  clusters is also hard to compute (if you have thrown away the similarity
  matrix), do you count paths to same parent node if you are computing
  distances between items in two adjacent nodes for example. It is also a
 bit
  harder to distribute the computation for bottom up approaches as you have
  to already find the nearest neighbor to an item to begin the process.
 
 
  On Tue, Jul 8, 2014 at 1:59 PM, RJ Nowling rnowl...@gmail.com wrote:
 
   The scikit-learn implementation may be of interest:
  
  
  
 
 http://scikit-learn.org/stable/modules/generated/sklearn.cluster.Ward.html#sklearn.cluster.Ward
  
   It's a bottom up approach.  The pair of clusters for merging are
   chosen to minimize variance.
  
   Their code is under a BSD license so it can be used as a template.
  
   Is something like that you were thinking Hector?
  
   On Tue, Jul 8, 2014 at 4:50 PM, Dmitriy Lyubimov dlie...@gmail.com
   wrote:
sure. more interesting problem here is choosing k at each level.
 Kernel
methods seem to be most promising.
   
   
On Tue, Jul 8, 2014 at 1:31 PM, Hector Yee hector@gmail.com
  wrote:
   
No idea, never looked it up. Always just implemented it as doing
  k-means
again on each cluster.
   
FWIW standard k-means with euclidean distance has problems too with
  some
dimensionality reduction methods. Swapping out the distance metric
  with
negative dot or cosine may help.
   
Other more useful clustering would be hierarchical SVD. The reason
  why I
like hierarchical clustering is it makes for faster inference
  especially
over billions of users.
   
   
On Tue, Jul 8, 2014 at 1:24 PM, Dmitriy Lyubimov dlie...@gmail.com
 
wrote:
   
 Hector, could you share the references for hierarchical K-means?
   thanks.


 On Tue, Jul 8, 2014 at 1:01 PM, Hector Yee hector@gmail.com
   wrote:

  I would say for bigdata applications the most useful would be
 hierarchical
  k-means with back tracking and the ability to support k nearest
 centroids.
 
 
  On Tue, Jul 8, 2014 at 10:54 AM, RJ Nowling rnowl...@gmail.com
 
wrote:
 
   Hi all,
  
   MLlib currently has one clustering algorithm implementation,
   KMeans.
   It would benefit from having implementations of other
 clustering
   algorithms such as MiniBatch KMeans, Fuzzy C-Means,
 Hierarchical
   Clustering, and Affinity Propagation.
  
   I recently submitted a PR [1] for a MiniBatch KMeans
   implementation,
   and I saw an email on this list about interest in implementing
   Fuzzy
   C-Means.
  
   Based on Sean Owen's review of my MiniBatch KMeans code, it
  became
   apparent that before I implement more clustering algorithms,
 it
   would
   be useful to hammer out a framework to reduce code duplication
  and
   implement a consistent API.
  
   I'd like to gauge the interest and goals of the MLlib
 community:
  
   1. Are you interested in having more clustering algorithms
   available?
  
   2. Is the community interested in specifying a common
 framework?
  
   Thanks!
   RJ
  
   [1] - https://github.com/apache/spark/pull/1248
  
  
   --
   em rnowl...@gmail.com
   c 954.496.2314
  
 
 
 
  --
  Yee Yang Li Hector http://google.com/+HectorYee
  *google.com/+HectorYee http://google.com/+HectorYee*
 

   
   
   
--
Yee Yang Li Hector http://google.com/+HectorYee
*google.com/+HectorYee http://google.com/+HectorYee*
   
  
  
  
   --
   em