subject:"Contributing to MLlib\: Proposal for Clustering Algorithms"

Re: Contributing to MLlib: Proposal for Clustering Algorithms

2014-08-27 Thread RJ Nowling

Hi Yu,

A standardized API has not been implemented yet.  I think it would be
better to implement the other clustering algorithms then extract a common
API.  Others may feel differently.  :)

Just a note, there was a pre-existing JIRA for hierarchical KMeans
SPARK-2429 https://issues.apache.org/jira/browse/SPARK-2429 I filed.  I
added a comment about previous discussion on the mailing list, example code
provided by a Jeremy Freeman, and a couple of papers I found.

Feel free to take this over -- I've played with it but haven't had time to
finish it.  I'd be happy to review the resulting code and discuss
approaches with you.

RJ



On Wed, Aug 13, 2014 at 9:20 AM, Yu Ishikawa yuu.ishikawa+sp...@gmail.com
wrote:

 Hi all,

 I am also interested in specifying a common framework.
 And I am trying to implement a hierarchical k-means and a hierarchical
 clustering like single-link method with LSH.
 https://issues.apache.org/jira/browse/SPARK-2966

 If you have designed the standardized clustering algorithms API, please let
 me know.


 best,
 Yu Ishikawa



 --
 View this message in context:
 http://apache-spark-developers-list.1001551.n3.nabble.com/Contributing-to-MLlib-Proposal-for-Clustering-Algorithms-tp7212p7822.html
 Sent from the Apache Spark Developers List mailing list archive at
 Nabble.com.

 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org




-- 
em rnowl...@gmail.com
c 954.496.2314

Re: Contributing to MLlib: Proposal for Clustering Algorithms

2014-08-27 Thread Jeremy Freeman

Hey RJ,

Sorry for the delay, I'd be happy to take a look at this if you can post the
code!

I think splitting the largest cluster in each round is fairly common, but
ideally it would be an option to do it one way or the other.

-- Jeremy

-
jeremy freeman, phd
neuroscientist
@thefreemanlab

On Aug 12, 2014, at 2:20 PM, RJ Nowling rnowl...@gmail.com wrote:

Hi all,

I wanted to follow up.

I have a prototype for an optimized version of hierarchical k-means. I
wanted to get some feedback on my apporach.

Jeremy's implementation splits the largest cluster in each round. Is it
better to do it that way or to split each cluster in half?

Are there are any open-source examples that are being widely used in
production?

Thanks!

On Fri, Jul 18, 2014 at 8:05 AM, RJ Nowling rnowl...@gmail.com wrote:

Nice to meet you, Jeremy!

This is great! Hierarchical clustering was next on my list --
currently trying to get my PR for MiniBatch KMeans accepted.

If it's cool with you, I'll try converting your code to fit in with
the existing MLLib code as you suggest. I also need to review the
Decision Tree code (as suggested above) to see how much of that can be
reused.

Maybe I can ask you to do a code review for me when I'm done?

On Thu, Jul 17, 2014 at 8:31 PM, Jeremy Freeman
freeman.jer...@gmail.com wrote:
Hi all,

Cool discussion! I agree that a more standardized API for clustering, and
easy access to underlying routines, would be useful (we've also been
discussing this when trying to develop streaming clustering algorithms,
similar to https://github.com/apache/spark/pull/1361)

For divisive, hierarchical clustering I implemented something awhile
back,
here's a gist.

https://gist.github.com/freeman-lab/5947e7c53b368fe90371

It does bisecting k-means clustering (with k=2), with a recursive class
for
keeping track of the tree. I also found this much better than
agglomerative
methods (for the reasons Hector points out).

This needs to be cleaned up, and can surely be optimized (esp. by
replacing
the core KMeans step with existing MLLib code), but I can say I was
running
it successfully on quite large data sets.

RJ, depending on where you are in your progress, I'd be happy to help
work
on this piece and / or have you use this as a jumping off point, if
useful.

-- Jeremy

--
View this message in context:
http://apache-spark-developers-list.1001551.n3.nabble.com/Contributing-to-MLlib-Proposal-for-Clustering-Algorithms-tp7212p7398.html
Sent from the Apache Spark Developers List mailing list archive at
Nabble.com.

--
em rnowl...@gmail.com
c 954.496.2314

Re: Contributing to MLlib: Proposal for Clustering Algorithms

2014-08-27 Thread RJ Nowling

Thanks, Jeremy. I'm abandoning my initial approach, and I'll work on
optimizing your example (so it doesn't do the breeze-vector conversions
every time KMeans is called). I need to finish a few other projects first,
though, so it may be a couple weeks.

In the mean time, Yu also created a JIRA for a hierarchical KMeans
implementation. I pointed him to your example and a couple papers I found.

If you or Yu beat me to getting an implementation in, I'd be happy to
review it. :)

On Wed, Aug 27, 2014 at 12:18 PM, Jeremy Freeman freeman.jer...@gmail.com
wrote:

Hey RJ,

Sorry for the delay, I'd be happy to take a look at this if you can post
the code!

I think splitting the largest cluster in each round is fairly common, but
ideally it would be an option to do it one way or the other.

-- Jeremy

-
jeremy freeman, phd
neuroscientist
@thefreemanlab

On Aug 12, 2014, at 2:20 PM, RJ Nowling rnowl...@gmail.com wrote:

Hi all,

I wanted to follow up.

I have a prototype for an optimized version of hierarchical k-means. I
wanted to get some feedback on my apporach.

Jeremy's implementation splits the largest cluster in each round. Is it
better to do it that way or to split each cluster in half?

Are there are any open-source examples that are being widely used in
production?

Thanks!

On Fri, Jul 18, 2014 at 8:05 AM, RJ Nowling rnowl...@gmail.com wrote:

Nice to meet you, Jeremy!

This is great! Hierarchical clustering was next on my list --
currently trying to get my PR for MiniBatch KMeans accepted.

Maybe I can ask you to do a code review for me when I'm done?

On Thu, Jul 17, 2014 at 8:31 PM, Jeremy Freeman
freeman.jer...@gmail.com wrote:

Hi all,

For divisive, hierarchical clustering I implemented something awhile

back,

here's a gist.

https://gist.github.com/freeman-lab/5947e7c53b368fe90371

It does bisecting k-means clustering (with k=2), with a recursive class

for

keeping track of the tree. I also found this much better than

agglomerative

methods (for the reasons Hector points out).

This needs to be cleaned up, and can surely be optimized (esp. by

replacing

the core KMeans step with existing MLLib code), but I can say I was

running

it successfully on quite large data sets.

RJ, depending on where you are in your progress, I'd be happy to help

work

on this piece and / or have you use this as a jumping off point, if

useful.

-- Jeremy

--
View this message in context:

http://apache-spark-developers-list.1001551.n3.nabble.com/Contributing-to-MLlib-Proposal-for-Clustering-Algorithms-tp7212p7398.html

Sent from the Apache Spark Developers List mailing list archive at

Nabble.com.

--
em rnowl...@gmail.com
c 954.496.2314

Re: Contributing to MLlib: Proposal for Clustering Algorithms

2014-08-13 Thread Yu Ishikawa

Hi all,

I am also interested in specifying a common framework.
And I am trying to implement a hierarchical k-means and a hierarchical
clustering like single-link method with LSH.
https://issues.apache.org/jira/browse/SPARK-2966

If you have designed the standardized clustering algorithms API, please let
me know.


best,
Yu Ishikawa



--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/Contributing-to-MLlib-Proposal-for-Clustering-Algorithms-tp7212p7822.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org

Re: Contributing to MLlib: Proposal for Clustering Algorithms

2014-08-12 Thread RJ Nowling

Hi all,

I wanted to follow up.

I have a prototype for an optimized version of hierarchical k-means. I
wanted to get some feedback on my apporach.

Jeremy's implementation splits the largest cluster in each round. Is it
better to do it that way or to split each cluster in half?

Are there are any open-source examples that are being widely used in
production?

Thanks!

On Fri, Jul 18, 2014 at 8:05 AM, RJ Nowling rnowl...@gmail.com wrote:

Nice to meet you, Jeremy!

This is great! Hierarchical clustering was next on my list --
currently trying to get my PR for MiniBatch KMeans accepted.

Maybe I can ask you to do a code review for me when I'm done?

On Thu, Jul 17, 2014 at 8:31 PM, Jeremy Freeman
freeman.jer...@gmail.com wrote:
Hi all,

For divisive, hierarchical clustering I implemented something awhile
back,
here's a gist.

https://gist.github.com/freeman-lab/5947e7c53b368fe90371

It does bisecting k-means clustering (with k=2), with a recursive class
for
keeping track of the tree. I also found this much better than
agglomerative
methods (for the reasons Hector points out).

This needs to be cleaned up, and can surely be optimized (esp. by
replacing
the core KMeans step with existing MLLib code), but I can say I was
running
it successfully on quite large data sets.

RJ, depending on where you are in your progress, I'd be happy to help
work
on this piece and / or have you use this as a jumping off point, if
useful.

-- Jeremy

--
em rnowl...@gmail.com
c 954.496.2314

Re: Contributing to MLlib: Proposal for Clustering Algorithms

2014-07-19 Thread Jeremy Freeman

Hi RJ, that sounds like a great idea. I'd be happy to look over what you put
together.

-- Jeremy



--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/Contributing-to-MLlib-Proposal-for-Clustering-Algorithms-tp7212p7418.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

Re: Contributing to MLlib: Proposal for Clustering Algorithms

2014-07-18 Thread RJ Nowling

Nice to meet you, Jeremy!

This is great! Hierarchical clustering was next on my list --
currently trying to get my PR for MiniBatch KMeans accepted.

Maybe I can ask you to do a code review for me when I'm done?

On Thu, Jul 17, 2014 at 8:31 PM, Jeremy Freeman
freeman.jer...@gmail.com wrote:
Hi all,

For divisive, hierarchical clustering I implemented something awhile back,
here's a gist.

https://gist.github.com/freeman-lab/5947e7c53b368fe90371

It does bisecting k-means clustering (with k=2), with a recursive class for
keeping track of the tree. I also found this much better than agglomerative
methods (for the reasons Hector points out).

This needs to be cleaned up, and can surely be optimized (esp. by replacing
the core KMeans step with existing MLLib code), but I can say I was running
it successfully on quite large data sets.

RJ, depending on where you are in your progress, I'd be happy to help work
on this piece and / or have you use this as a jumping off point, if useful.

-- Jeremy

--
em rnowl...@gmail.com
c 954.496.2314

Re: Contributing to MLlib: Proposal for Clustering Algorithms

2014-07-17 Thread Jeremy Freeman

Hi all, 

Cool discussion! I agree that a more standardized API for clustering, and
easy access to underlying routines, would be useful (we've also been
discussing this when trying to develop streaming clustering algorithms,
similar to https://github.com/apache/spark/pull/1361) 

For divisive, hierarchical clustering I implemented something awhile back,
here's a gist. 

https://gist.github.com/freeman-lab/5947e7c53b368fe90371

It does bisecting k-means clustering (with k=2), with a recursive class for
keeping track of the tree. I also found this much better than agglomerative
methods (for the reasons Hector points out).

This needs to be cleaned up, and can surely be optimized (esp. by replacing
the core KMeans step with existing MLLib code), but I can say I was running
it successfully on quite large data sets. 

RJ, depending on where you are in your progress, I'd be happy to help work
on this piece and / or have you use this as a jumping off point, if useful. 

-- Jeremy 



--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/Contributing-to-MLlib-Proposal-for-Clustering-Algorithms-tp7212p7398.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

Re: Contributing to MLlib: Proposal for Clustering Algorithms

2014-07-10 Thread RJ Nowling

I went ahead and created JIRAs.

JIRA for Hierarchical Clustering:
https://issues.apache.org/jira/browse/SPARK-2429

JIRA for Standarized Clustering APIs:
https://issues.apache.org/jira/browse/SPARK-2430

Before submitting a PR for the standardized API, I want to implement a
few clustering algorithms for myself to get a good feel for how to
structure such an API. If others with more experience want to dive
into designing the API, though, that would allow us to get moving more
quickly.

On Wed, Jul 9, 2014 at 8:39 AM, Nick Pentreath nick.pentre...@gmail.com wrote:
Cool seems like a god initiative. Adding a couple extra high quality
clustering implantations will be great.

I'd say it would make most sense to submit a PR for the Standardised API
first, agree that with everyone and then build on it for the specific
implementations.
—
Sent from Mailbox

On Wed, Jul 9, 2014 at 2:15 PM, RJ Nowling rnowl...@gmail.com wrote:

Thanks everyone for the input.
So it seems what people want is:
* Implement MiniBatch KMeans and Hierarchical KMeans (Divide and
conquer approach, look at DecisionTree implementation as a reference)
* Restructure 3 Kmeans clustering algorithm implementations to prevent
code duplication and conform to a consistent API where possible
If this is correct, I'll start work on that. How would it be best to
structure it? Should I submit separate JIRAs / PRs for refactoring of
current code, MiniBatch KMeans, and Hierarchical or keep my current
JIRA and PR for MiniBatch KMeans open and submit a second JIRA and PR
for Hierarchical KMeans that builds on top of that?
Thanks!
On Tue, Jul 8, 2014 at 5:44 PM, Hector Yee hector@gmail.com wrote:
Yeah if one were to replace the objective function in decision tree with
minimizing the variance of the leaf nodes it would be a hierarchical
clusterer.

On Tue, Jul 8, 2014 at 2:12 PM, Evan R. Sparks evan.spa...@gmail.com
wrote:

If you're thinking along these lines, have a look at the DecisionTree
implementation in MLlib. It uses the same idea and is optimized to prevent
multiple passes over the data by computing several splits at each level of
tree building. The tradeoff is increased model state and computation per
pass over the data, but fewer total passes and hopefully lower
communication overheads than, say, shuffling data around that belongs to
one cluster or another. Something like that could work here as well.

I'm not super-familiar with hierarchical K-Means so perhaps there's a more
efficient way to implement it, though.

On Tue, Jul 8, 2014 at 2:06 PM, Hector Yee hector@gmail.com wrote:

No was thinking more top-down:

assuming a distributed kmeans system already existing, recursively apply
the kmeans algorithm on data already partitioned by the previous level of
kmeans.

I haven't been much of a fan of bottom up approaches like HAC mainly
because they assume there is already a distance metric for items to
items.
This makes it hard to cluster new content. The distances between sibling
clusters is also hard to compute (if you have thrown away the similarity
matrix), do you count paths to same parent node if you are computing
distances between items in two adjacent nodes for example. It is also a
bit
harder to distribute the computation for bottom up approaches as you have
to already find the nearest neighbor to an item to begin the process.

On Tue, Jul 8, 2014 at 1:59 PM, RJ Nowling rnowl...@gmail.com wrote:

The scikit-learn implementation may be of interest:

http://scikit-learn.org/stable/modules/generated/sklearn.cluster.Ward.html#sklearn.cluster.Ward

It's a bottom up approach. The pair of clusters for merging are
chosen to minimize variance.

Their code is under a BSD license so it can be used as a template.

Is something like that you were thinking Hector?

On Tue, Jul 8, 2014 at 4:50 PM, Dmitriy Lyubimov dlie...@gmail.com
wrote:
sure. more interesting problem here is choosing k at each level.
Kernel
methods seem to be most promising.

On Tue, Jul 8, 2014 at 1:31 PM, Hector Yee hector@gmail.com
wrote:

No idea, never looked it up. Always just implemented it as doing
k-means
again on each cluster.

FWIW standard k-means with euclidean distance has problems too with
some
dimensionality reduction methods. Swapping out the distance metric
with
negative dot or cosine may help.

Other more useful clustering would be hierarchical SVD. The reason
why I
like hierarchical clustering is it makes for faster inference
especially
over billions of users.

On Tue, Jul 8, 2014 at 1:24 PM, Dmitriy Lyubimov dlie...@gmail.com

wrote:

Hector, could you share the references for hierarchical K-means?
thanks.

On Tue, Jul 8, 2014 at 1:01 PM, Hector Yee hector@gmail.com
wrote:

I would say for bigdata applications

Re: Contributing to MLlib: Proposal for Clustering Algorithms

2014-07-10 Thread Nick Pentreath

Might be worth checking out scikit-learn and mahout to get some broad ideas—
Sent from Mailbox

On Thu, Jul 10, 2014 at 4:25 PM, RJ Nowling rnowl...@gmail.com wrote:

I went ahead and created JIRAs.
JIRA for Hierarchical Clustering:
https://issues.apache.org/jira/browse/SPARK-2429
JIRA for Standarized Clustering APIs:
https://issues.apache.org/jira/browse/SPARK-2430
Before submitting a PR for the standardized API, I want to implement a
few clustering algorithms for myself to get a good feel for how to
structure such an API. If others with more experience want to dive
into designing the API, though, that would allow us to get moving more
quickly.
On Wed, Jul 9, 2014 at 8:39 AM, Nick Pentreath nick.pentre...@gmail.com
wrote:
Cool seems like a god initiative. Adding a couple extra high quality
clustering implantations will be great.

I'd say it would make most sense to submit a PR for the Standardised API
first, agree that with everyone and then build on it for the specific
implementations.
—
Sent from Mailbox

On Wed, Jul 9, 2014 at 2:15 PM, RJ Nowling rnowl...@gmail.com wrote:

On Tue, Jul 8, 2014 at 2:12 PM, Evan R. Sparks evan.spa...@gmail.com
wrote:

I'm not super-familiar with hierarchical K-Means so perhaps there's a more
efficient way to implement it, though.

On Tue, Jul 8, 2014 at 2:06 PM, Hector Yee hector@gmail.com wrote:

No was thinking more top-down:

assuming a distributed kmeans system already existing, recursively apply
the kmeans algorithm on data already partitioned by the previous level
of
kmeans.

I haven't been much of a fan of bottom up approaches like HAC mainly
because they assume there is already a distance metric for items to
items.
This makes it hard to cluster new content. The distances between sibling
clusters is also hard to compute (if you have thrown away the similarity
matrix), do you count paths to same parent node if you are computing
distances between items in two adjacent nodes for example. It is also a
bit
harder to distribute the computation for bottom up approaches as you
have
to already find the nearest neighbor to an item to begin the process.

On Tue, Jul 8, 2014 at 1:59 PM, RJ Nowling rnowl...@gmail.com wrote:

The scikit-learn implementation may be of interest:

http://scikit-learn.org/stable/modules/generated/sklearn.cluster.Ward.html#sklearn.cluster.Ward

It's a bottom up approach. The pair of clusters for merging are
chosen to minimize variance.

Their code is under a BSD license so it can be used as a template.

Is something like that you were thinking Hector?

On Tue, Jul 8, 2014 at 4:50 PM, Dmitriy Lyubimov dlie...@gmail.com
wrote:
sure. more interesting problem here is choosing k at each level.
Kernel
methods seem to be most promising.

On Tue, Jul 8, 2014 at 1:31 PM, Hector Yee hector@gmail.com
wrote:

No idea, never looked it up. Always just implemented it as doing
k-means
again on each cluster.

FWIW standard k-means with euclidean distance has problems too with
some
dimensionality reduction methods. Swapping out the distance metric
with
negative dot or cosine may help.

Other more useful clustering would be hierarchical SVD. The reason
why I
like hierarchical clustering is it makes for faster inference
especially
over billions of users.

On Tue, Jul 8, 2014 at 1:24 PM, Dmitriy Lyubimov dlie...@gmail.com

wrote:

Hector, could you share the

Re: Contributing to MLlib: Proposal for Clustering Algorithms

2014-07-09 Thread RJ Nowling

Thanks everyone for the input.

So it seems what people want is:

* Implement MiniBatch KMeans and Hierarchical KMeans (Divide and
conquer approach, look at DecisionTree implementation as a reference)
* Restructure 3 Kmeans clustering algorithm implementations to prevent
code duplication and conform to a consistent API where possible

If this is correct, I'll start work on that. How would it be best to
structure it? Should I submit separate JIRAs / PRs for refactoring of
current code, MiniBatch KMeans, and Hierarchical or keep my current
JIRA and PR for MiniBatch KMeans open and submit a second JIRA and PR
for Hierarchical KMeans that builds on top of that?

Thanks!

On Tue, Jul 8, 2014 at 5:44 PM, Hector Yee hector@gmail.com wrote:
Yeah if one were to replace the objective function in decision tree with
minimizing the variance of the leaf nodes it would be a hierarchical
clusterer.

On Tue, Jul 8, 2014 at 2:12 PM, Evan R. Sparks evan.spa...@gmail.com
wrote:

I'm not super-familiar with hierarchical K-Means so perhaps there's a more
efficient way to implement it, though.

On Tue, Jul 8, 2014 at 2:06 PM, Hector Yee hector@gmail.com wrote:

No was thinking more top-down:

assuming a distributed kmeans system already existing, recursively apply
the kmeans algorithm on data already partitioned by the previous level of
kmeans.

On Tue, Jul 8, 2014 at 1:59 PM, RJ Nowling rnowl...@gmail.com wrote:

The scikit-learn implementation may be of interest:

http://scikit-learn.org/stable/modules/generated/sklearn.cluster.Ward.html#sklearn.cluster.Ward

It's a bottom up approach. The pair of clusters for merging are
chosen to minimize variance.

Their code is under a BSD license so it can be used as a template.

Is something like that you were thinking Hector?

On Tue, Jul 8, 2014 at 4:50 PM, Dmitriy Lyubimov dlie...@gmail.com
wrote:
sure. more interesting problem here is choosing k at each level.
Kernel
methods seem to be most promising.

On Tue, Jul 8, 2014 at 1:31 PM, Hector Yee hector@gmail.com
wrote:

No idea, never looked it up. Always just implemented it as doing
k-means
again on each cluster.

FWIW standard k-means with euclidean distance has problems too with
some
dimensionality reduction methods. Swapping out the distance metric
with
negative dot or cosine may help.

Other more useful clustering would be hierarchical SVD. The reason
why I
like hierarchical clustering is it makes for faster inference
especially
over billions of users.

On Tue, Jul 8, 2014 at 1:24 PM, Dmitriy Lyubimov dlie...@gmail.com

wrote:

Hector, could you share the references for hierarchical K-means?
thanks.

On Tue, Jul 8, 2014 at 1:01 PM, Hector Yee hector@gmail.com
wrote:

I would say for bigdata applications the most useful would be
hierarchical
k-means with back tracking and the ability to support k nearest
centroids.

On Tue, Jul 8, 2014 at 10:54 AM, RJ Nowling rnowl...@gmail.com

wrote:

Hi all,

MLlib currently has one clustering algorithm implementation,
KMeans.
It would benefit from having implementations of other
clustering
algorithms such as MiniBatch KMeans, Fuzzy C-Means,
Hierarchical
Clustering, and Affinity Propagation.

I recently submitted a PR [1] for a MiniBatch KMeans
implementation,
and I saw an email on this list about interest in implementing
Fuzzy
C-Means.

Based on Sean Owen's review of my MiniBatch KMeans code, it
became
apparent that before I implement more clustering algorithms,
it
would
be useful to hammer out a framework to reduce code duplication
and

Re: Contributing to MLlib: Proposal for Clustering Algorithms

2014-07-09 Thread Nick Pentreath

Cool seems like a god initiative. Adding a couple extra high quality clustering
implantations will be great.

I'd say it would make most sense to submit a PR for the Standardised API first,
agree that with everyone and then build on it for the specific implementations.
—
Sent from Mailbox

On Wed, Jul 9, 2014 at 2:15 PM, RJ Nowling rnowl...@gmail.com wrote:

On Tue, Jul 8, 2014 at 2:12 PM, Evan R. Sparks evan.spa...@gmail.com
wrote:

I'm not super-familiar with hierarchical K-Means so perhaps there's a more
efficient way to implement it, though.

On Tue, Jul 8, 2014 at 2:06 PM, Hector Yee hector@gmail.com wrote:

No was thinking more top-down:

assuming a distributed kmeans system already existing, recursively apply
the kmeans algorithm on data already partitioned by the previous level of
kmeans.

On Tue, Jul 8, 2014 at 1:59 PM, RJ Nowling rnowl...@gmail.com wrote:

The scikit-learn implementation may be of interest:

http://scikit-learn.org/stable/modules/generated/sklearn.cluster.Ward.html#sklearn.cluster.Ward

It's a bottom up approach. The pair of clusters for merging are
chosen to minimize variance.

Their code is under a BSD license so it can be used as a template.

Is something like that you were thinking Hector?

On Tue, Jul 8, 2014 at 4:50 PM, Dmitriy Lyubimov dlie...@gmail.com
wrote:
sure. more interesting problem here is choosing k at each level.
Kernel
methods seem to be most promising.

On Tue, Jul 8, 2014 at 1:31 PM, Hector Yee hector@gmail.com
wrote:

No idea, never looked it up. Always just implemented it as doing
k-means
again on each cluster.

FWIW standard k-means with euclidean distance has problems too with
some
dimensionality reduction methods. Swapping out the distance metric
with
negative dot or cosine may help.

Other more useful clustering would be hierarchical SVD. The reason
why I
like hierarchical clustering is it makes for faster inference
especially
over billions of users.

On Tue, Jul 8, 2014 at 1:24 PM, Dmitriy Lyubimov dlie...@gmail.com

wrote:

Hector, could you share the references for hierarchical K-means?
thanks.

On Tue, Jul 8, 2014 at 1:01 PM, Hector Yee hector@gmail.com
wrote:

I would say for bigdata applications the most useful would be
hierarchical
k-means with back tracking and the ability to support k nearest
centroids.

On Tue, Jul 8, 2014 at 10:54 AM, RJ Nowling rnowl...@gmail.com

wrote:

Hi all,

I recently submitted a PR [1] for a MiniBatch

Contributing to MLlib: Proposal for Clustering Algorithms

2014-07-08 Thread RJ Nowling

Hi all,

MLlib currently has one clustering algorithm implementation, KMeans.
It would benefit from having implementations of other clustering
algorithms such as MiniBatch KMeans, Fuzzy C-Means, Hierarchical
Clustering, and Affinity Propagation.

I recently submitted a PR [1] for a MiniBatch KMeans implementation,
and I saw an email on this list about interest in implementing Fuzzy
C-Means.

Based on Sean Owen's review of my MiniBatch KMeans code, it became
apparent that before I implement more clustering algorithms, it would
be useful to hammer out a framework to reduce code duplication and
implement a consistent API.

I'd like to gauge the interest and goals of the MLlib community:

1. Are you interested in having more clustering algorithms available?

2. Is the community interested in specifying a common framework?

Thanks!
RJ

[1] - https://github.com/apache/spark/pull/1248


-- 
em rnowl...@gmail.com
c 954.496.2314

Re: Contributing to MLlib: Proposal for Clustering Algorithms

2014-07-08 Thread Hector Yee

I would say for bigdata applications the most useful would be hierarchical
k-means with back tracking and the ability to support k nearest centroids.


On Tue, Jul 8, 2014 at 10:54 AM, RJ Nowling rnowl...@gmail.com wrote:

 Hi all,

 MLlib currently has one clustering algorithm implementation, KMeans.
 It would benefit from having implementations of other clustering
 algorithms such as MiniBatch KMeans, Fuzzy C-Means, Hierarchical
 Clustering, and Affinity Propagation.

 I recently submitted a PR [1] for a MiniBatch KMeans implementation,
 and I saw an email on this list about interest in implementing Fuzzy
 C-Means.

 Based on Sean Owen's review of my MiniBatch KMeans code, it became
 apparent that before I implement more clustering algorithms, it would
 be useful to hammer out a framework to reduce code duplication and
 implement a consistent API.

 I'd like to gauge the interest and goals of the MLlib community:

 1. Are you interested in having more clustering algorithms available?

 2. Is the community interested in specifying a common framework?

 Thanks!
 RJ

 [1] - https://github.com/apache/spark/pull/1248


 --
 em rnowl...@gmail.com
 c 954.496.2314




-- 
Yee Yang Li Hector http://google.com/+HectorYee
*google.com/+HectorYee http://google.com/+HectorYee*

Re: Contributing to MLlib: Proposal for Clustering Algorithms

2014-07-08 Thread RJ Nowling

Thanks, Hector! Your feedback is useful.

On Tuesday, July 8, 2014, Hector Yee hector@gmail.com wrote:

 I would say for bigdata applications the most useful would be hierarchical
 k-means with back tracking and the ability to support k nearest centroids.


 On Tue, Jul 8, 2014 at 10:54 AM, RJ Nowling rnowl...@gmail.com
 javascript:; wrote:

  Hi all,
 
  MLlib currently has one clustering algorithm implementation, KMeans.
  It would benefit from having implementations of other clustering
  algorithms such as MiniBatch KMeans, Fuzzy C-Means, Hierarchical
  Clustering, and Affinity Propagation.
 
  I recently submitted a PR [1] for a MiniBatch KMeans implementation,
  and I saw an email on this list about interest in implementing Fuzzy
  C-Means.
 
  Based on Sean Owen's review of my MiniBatch KMeans code, it became
  apparent that before I implement more clustering algorithms, it would
  be useful to hammer out a framework to reduce code duplication and
  implement a consistent API.
 
  I'd like to gauge the interest and goals of the MLlib community:
 
  1. Are you interested in having more clustering algorithms available?
 
  2. Is the community interested in specifying a common framework?
 
  Thanks!
  RJ
 
  [1] - https://github.com/apache/spark/pull/1248
 
 
  --
  em rnowl...@gmail.com javascript:;
  c 954.496.2314
 



 --
 Yee Yang Li Hector http://google.com/+HectorYee
 *google.com/+HectorYee http://google.com/+HectorYee*



-- 
em rnowl...@gmail.com
c 954.496.2314

Re: Contributing to MLlib: Proposal for Clustering Algorithms

2014-07-08 Thread Dmitriy Lyubimov

Hector, could you share the references for hierarchical K-means? thanks.


On Tue, Jul 8, 2014 at 1:01 PM, Hector Yee hector@gmail.com wrote:

 I would say for bigdata applications the most useful would be hierarchical
 k-means with back tracking and the ability to support k nearest centroids.


 On Tue, Jul 8, 2014 at 10:54 AM, RJ Nowling rnowl...@gmail.com wrote:

  Hi all,
 
  MLlib currently has one clustering algorithm implementation, KMeans.
  It would benefit from having implementations of other clustering
  algorithms such as MiniBatch KMeans, Fuzzy C-Means, Hierarchical
  Clustering, and Affinity Propagation.
 
  I recently submitted a PR [1] for a MiniBatch KMeans implementation,
  and I saw an email on this list about interest in implementing Fuzzy
  C-Means.
 
  Based on Sean Owen's review of my MiniBatch KMeans code, it became
  apparent that before I implement more clustering algorithms, it would
  be useful to hammer out a framework to reduce code duplication and
  implement a consistent API.
 
  I'd like to gauge the interest and goals of the MLlib community:
 
  1. Are you interested in having more clustering algorithms available?
 
  2. Is the community interested in specifying a common framework?
 
  Thanks!
  RJ
 
  [1] - https://github.com/apache/spark/pull/1248
 
 
  --
  em rnowl...@gmail.com
  c 954.496.2314
 



 --
 Yee Yang Li Hector http://google.com/+HectorYee
 *google.com/+HectorYee http://google.com/+HectorYee*

Re: Contributing to MLlib: Proposal for Clustering Algorithms

2014-07-08 Thread Sandy Ryza

Having a common framework for clustering makes sense to me.  While we
should be careful about what algorithms we include, having solid
implementations of minibatch clustering and hierarchical clustering seems
like a worthwhile goal, and we should reuse as much code and APIs as
reasonable.


On Tue, Jul 8, 2014 at 1:19 PM, RJ Nowling rnowl...@gmail.com wrote:

 Thanks, Hector! Your feedback is useful.

 On Tuesday, July 8, 2014, Hector Yee hector@gmail.com wrote:

  I would say for bigdata applications the most useful would be
 hierarchical
  k-means with back tracking and the ability to support k nearest
 centroids.
 
 
  On Tue, Jul 8, 2014 at 10:54 AM, RJ Nowling rnowl...@gmail.com
  javascript:; wrote:
 
   Hi all,
  
   MLlib currently has one clustering algorithm implementation, KMeans.
   It would benefit from having implementations of other clustering
   algorithms such as MiniBatch KMeans, Fuzzy C-Means, Hierarchical
   Clustering, and Affinity Propagation.
  
   I recently submitted a PR [1] for a MiniBatch KMeans implementation,
   and I saw an email on this list about interest in implementing Fuzzy
   C-Means.
  
   Based on Sean Owen's review of my MiniBatch KMeans code, it became
   apparent that before I implement more clustering algorithms, it would
   be useful to hammer out a framework to reduce code duplication and
   implement a consistent API.
  
   I'd like to gauge the interest and goals of the MLlib community:
  
   1. Are you interested in having more clustering algorithms available?
  
   2. Is the community interested in specifying a common framework?
  
   Thanks!
   RJ
  
   [1] - https://github.com/apache/spark/pull/1248
  
  
   --
   em rnowl...@gmail.com javascript:;
   c 954.496.2314
  
 
 
 
  --
  Yee Yang Li Hector http://google.com/+HectorYee
  *google.com/+HectorYee http://google.com/+HectorYee*
 


 --
 em rnowl...@gmail.com
 c 954.496.2314

Re: Contributing to MLlib: Proposal for Clustering Algorithms

2014-07-08 Thread Hector Yee

No idea, never looked it up. Always just implemented it as doing k-means
again on each cluster.

FWIW standard k-means with euclidean distance has problems too with some
dimensionality reduction methods. Swapping out the distance metric with
negative dot or cosine may help.

Other more useful clustering would be hierarchical SVD. The reason why I
like hierarchical clustering is it makes for faster inference especially
over billions of users.


On Tue, Jul 8, 2014 at 1:24 PM, Dmitriy Lyubimov dlie...@gmail.com wrote:

 Hector, could you share the references for hierarchical K-means? thanks.


 On Tue, Jul 8, 2014 at 1:01 PM, Hector Yee hector@gmail.com wrote:

  I would say for bigdata applications the most useful would be
 hierarchical
  k-means with back tracking and the ability to support k nearest
 centroids.
 
 
  On Tue, Jul 8, 2014 at 10:54 AM, RJ Nowling rnowl...@gmail.com wrote:
 
   Hi all,
  
   MLlib currently has one clustering algorithm implementation, KMeans.
   It would benefit from having implementations of other clustering
   algorithms such as MiniBatch KMeans, Fuzzy C-Means, Hierarchical
   Clustering, and Affinity Propagation.
  
   I recently submitted a PR [1] for a MiniBatch KMeans implementation,
   and I saw an email on this list about interest in implementing Fuzzy
   C-Means.
  
   Based on Sean Owen's review of my MiniBatch KMeans code, it became
   apparent that before I implement more clustering algorithms, it would
   be useful to hammer out a framework to reduce code duplication and
   implement a consistent API.
  
   I'd like to gauge the interest and goals of the MLlib community:
  
   1. Are you interested in having more clustering algorithms available?
  
   2. Is the community interested in specifying a common framework?
  
   Thanks!
   RJ
  
   [1] - https://github.com/apache/spark/pull/1248
  
  
   --
   em rnowl...@gmail.com
   c 954.496.2314
  
 
 
 
  --
  Yee Yang Li Hector http://google.com/+HectorYee
  *google.com/+HectorYee http://google.com/+HectorYee*
 




-- 
Yee Yang Li Hector http://google.com/+HectorYee
*google.com/+HectorYee http://google.com/+HectorYee*

Re: Contributing to MLlib: Proposal for Clustering Algorithms

2014-07-08 Thread Dmitriy Lyubimov

sure. more interesting problem here is choosing k at each level. Kernel
methods seem to be most promising.


On Tue, Jul 8, 2014 at 1:31 PM, Hector Yee hector@gmail.com wrote:

 No idea, never looked it up. Always just implemented it as doing k-means
 again on each cluster.

 FWIW standard k-means with euclidean distance has problems too with some
 dimensionality reduction methods. Swapping out the distance metric with
 negative dot or cosine may help.

 Other more useful clustering would be hierarchical SVD. The reason why I
 like hierarchical clustering is it makes for faster inference especially
 over billions of users.


 On Tue, Jul 8, 2014 at 1:24 PM, Dmitriy Lyubimov dlie...@gmail.com
 wrote:

  Hector, could you share the references for hierarchical K-means? thanks.
 
 
  On Tue, Jul 8, 2014 at 1:01 PM, Hector Yee hector@gmail.com wrote:
 
   I would say for bigdata applications the most useful would be
  hierarchical
   k-means with back tracking and the ability to support k nearest
  centroids.
  
  
   On Tue, Jul 8, 2014 at 10:54 AM, RJ Nowling rnowl...@gmail.com
 wrote:
  
Hi all,
   
MLlib currently has one clustering algorithm implementation, KMeans.
It would benefit from having implementations of other clustering
algorithms such as MiniBatch KMeans, Fuzzy C-Means, Hierarchical
Clustering, and Affinity Propagation.
   
I recently submitted a PR [1] for a MiniBatch KMeans implementation,
and I saw an email on this list about interest in implementing Fuzzy
C-Means.
   
Based on Sean Owen's review of my MiniBatch KMeans code, it became
apparent that before I implement more clustering algorithms, it would
be useful to hammer out a framework to reduce code duplication and
implement a consistent API.
   
I'd like to gauge the interest and goals of the MLlib community:
   
1. Are you interested in having more clustering algorithms available?
   
2. Is the community interested in specifying a common framework?
   
Thanks!
RJ
   
[1] - https://github.com/apache/spark/pull/1248
   
   
--
em rnowl...@gmail.com
c 954.496.2314
   
  
  
  
   --
   Yee Yang Li Hector http://google.com/+HectorYee
   *google.com/+HectorYee http://google.com/+HectorYee*
  
 



 --
 Yee Yang Li Hector http://google.com/+HectorYee
 *google.com/+HectorYee http://google.com/+HectorYee*

Re: Contributing to MLlib: Proposal for Clustering Algorithms

2014-07-08 Thread RJ Nowling

The scikit-learn implementation may be of interest:

http://scikit-learn.org/stable/modules/generated/sklearn.cluster.Ward.html#sklearn.cluster.Ward

It's a bottom up approach. The pair of clusters for merging are
chosen to minimize variance.

Their code is under a BSD license so it can be used as a template.

Is something like that you were thinking Hector?

On Tue, Jul 8, 2014 at 4:50 PM, Dmitriy Lyubimov dlie...@gmail.com wrote:
sure. more interesting problem here is choosing k at each level. Kernel
methods seem to be most promising.

On Tue, Jul 8, 2014 at 1:31 PM, Hector Yee hector@gmail.com wrote:

No idea, never looked it up. Always just implemented it as doing k-means
again on each cluster.

FWIW standard k-means with euclidean distance has problems too with some
dimensionality reduction methods. Swapping out the distance metric with
negative dot or cosine may help.

Other more useful clustering would be hierarchical SVD. The reason why I
like hierarchical clustering is it makes for faster inference especially
over billions of users.

On Tue, Jul 8, 2014 at 1:24 PM, Dmitriy Lyubimov dlie...@gmail.com
wrote:

Hector, could you share the references for hierarchical K-means? thanks.

On Tue, Jul 8, 2014 at 1:01 PM, Hector Yee hector@gmail.com wrote:

I would say for bigdata applications the most useful would be
hierarchical
k-means with back tracking and the ability to support k nearest
centroids.

On Tue, Jul 8, 2014 at 10:54 AM, RJ Nowling rnowl...@gmail.com
wrote:

Hi all,

MLlib currently has one clustering algorithm implementation, KMeans.
It would benefit from having implementations of other clustering
algorithms such as MiniBatch KMeans, Fuzzy C-Means, Hierarchical
Clustering, and Affinity Propagation.

I recently submitted a PR [1] for a MiniBatch KMeans implementation,
and I saw an email on this list about interest in implementing Fuzzy
C-Means.

Based on Sean Owen's review of my MiniBatch KMeans code, it became
apparent that before I implement more clustering algorithms, it would
be useful to hammer out a framework to reduce code duplication and
implement a consistent API.

I'd like to gauge the interest and goals of the MLlib community:

1. Are you interested in having more clustering algorithms available?

2. Is the community interested in specifying a common framework?

Thanks!
RJ

[1] - https://github.com/apache/spark/pull/1248

--
em rnowl...@gmail.com
c 954.496.2314

--
Yee Yang Li Hector http://google.com/+HectorYee
*google.com/+HectorYee http://google.com/+HectorYee*

--
em rnowl...@gmail.com
c 954.496.2314

Re: Contributing to MLlib: Proposal for Clustering Algorithms

2014-07-08 Thread Hector Yee

K doesn't matter much I've tried anything from 2^10 to 10^3 and the
performance
doesn't change much as measured by precision @ K. (see table 1
http://machinelearning.wustl.edu/mlpapers/papers/weston13). Although 10^3
kmeans did outperform 2^10 hierarchical SVD slightly in terms of the
metrics, 2^10 SVD was much faster in terms of inference time.

I found the thing that affected performance most was adding in back
tracking to fix mistakes made at higher levels rather than how the K is
picked per level.



On Tue, Jul 8, 2014 at 1:50 PM, Dmitriy Lyubimov dlie...@gmail.com wrote:

 sure. more interesting problem here is choosing k at each level. Kernel
 methods seem to be most promising.


 On Tue, Jul 8, 2014 at 1:31 PM, Hector Yee hector@gmail.com wrote:

  No idea, never looked it up. Always just implemented it as doing k-means
  again on each cluster.
 
  FWIW standard k-means with euclidean distance has problems too with some
  dimensionality reduction methods. Swapping out the distance metric with
  negative dot or cosine may help.
 
  Other more useful clustering would be hierarchical SVD. The reason why I
  like hierarchical clustering is it makes for faster inference especially
  over billions of users.
 
 
  On Tue, Jul 8, 2014 at 1:24 PM, Dmitriy Lyubimov dlie...@gmail.com
  wrote:
 
   Hector, could you share the references for hierarchical K-means?
 thanks.
  
  
   On Tue, Jul 8, 2014 at 1:01 PM, Hector Yee hector@gmail.com
 wrote:
  
I would say for bigdata applications the most useful would be
   hierarchical
k-means with back tracking and the ability to support k nearest
   centroids.
   
   
On Tue, Jul 8, 2014 at 10:54 AM, RJ Nowling rnowl...@gmail.com
  wrote:
   
 Hi all,

 MLlib currently has one clustering algorithm implementation,
 KMeans.
 It would benefit from having implementations of other clustering
 algorithms such as MiniBatch KMeans, Fuzzy C-Means, Hierarchical
 Clustering, and Affinity Propagation.

 I recently submitted a PR [1] for a MiniBatch KMeans
 implementation,
 and I saw an email on this list about interest in implementing
 Fuzzy
 C-Means.

 Based on Sean Owen's review of my MiniBatch KMeans code, it became
 apparent that before I implement more clustering algorithms, it
 would
 be useful to hammer out a framework to reduce code duplication and
 implement a consistent API.

 I'd like to gauge the interest and goals of the MLlib community:

 1. Are you interested in having more clustering algorithms
 available?

 2. Is the community interested in specifying a common framework?

 Thanks!
 RJ

 [1] - https://github.com/apache/spark/pull/1248


 --
 em rnowl...@gmail.com
 c 954.496.2314

   
   
   
--
Yee Yang Li Hector http://google.com/+HectorYee
*google.com/+HectorYee http://google.com/+HectorYee*
   
  
 
 
 
  --
  Yee Yang Li Hector http://google.com/+HectorYee
  *google.com/+HectorYee http://google.com/+HectorYee*
 




-- 
Yee Yang Li Hector http://google.com/+HectorYee
*google.com/+HectorYee http://google.com/+HectorYee*

Re: Contributing to MLlib: Proposal for Clustering Algorithms

2014-07-08 Thread Hector Yee

No was thinking more top-down:

assuming a distributed kmeans system already existing, recursively apply
the kmeans algorithm on data already partitioned by the previous level of
kmeans.

I haven't been much of a fan of bottom up approaches like HAC mainly
because they assume there is already a distance metric for items to items.
This makes it hard to cluster new content. The distances between sibling
clusters is also hard to compute (if you have thrown away the similarity
matrix), do you count paths to same parent node if you are computing
distances between items in two adjacent nodes for example. It is also a bit
harder to distribute the computation for bottom up approaches as you have
to already find the nearest neighbor to an item to begin the process.

On Tue, Jul 8, 2014 at 1:59 PM, RJ Nowling rnowl...@gmail.com wrote:

The scikit-learn implementation may be of interest:

http://scikit-learn.org/stable/modules/generated/sklearn.cluster.Ward.html#sklearn.cluster.Ward

It's a bottom up approach. The pair of clusters for merging are
chosen to minimize variance.

Their code is under a BSD license so it can be used as a template.

Is something like that you were thinking Hector?

On Tue, Jul 8, 2014 at 4:50 PM, Dmitriy Lyubimov dlie...@gmail.com
wrote:
sure. more interesting problem here is choosing k at each level. Kernel
methods seem to be most promising.