Re: Contributing to MLlib: Proposal for Clustering Algorithms
Hi Yu, A standardized API has not been implemented yet. I think it would be better to implement the other clustering algorithms then extract a common API. Others may feel differently. :) Just a note, there was a pre-existing JIRA for hierarchical KMeans SPARK-2429 https://issues.apache.org/jira/browse/SPARK-2429 I filed. I added a comment about previous discussion on the mailing list, example code provided by a Jeremy Freeman, and a couple of papers I found. Feel free to take this over -- I've played with it but haven't had time to finish it. I'd be happy to review the resulting code and discuss approaches with you. RJ On Wed, Aug 13, 2014 at 9:20 AM, Yu Ishikawa yuu.ishikawa+sp...@gmail.com wrote: Hi all, I am also interested in specifying a common framework. And I am trying to implement a hierarchical k-means and a hierarchical clustering like single-link method with LSH. https://issues.apache.org/jira/browse/SPARK-2966 If you have designed the standardized clustering algorithms API, please let me know. best, Yu Ishikawa -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Contributing-to-MLlib-Proposal-for-Clustering-Algorithms-tp7212p7822.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org -- em rnowl...@gmail.com c 954.496.2314
Re: Contributing to MLlib: Proposal for Clustering Algorithms
Hey RJ, Sorry for the delay, I'd be happy to take a look at this if you can post the code! I think splitting the largest cluster in each round is fairly common, but ideally it would be an option to do it one way or the other. -- Jeremy - jeremy freeman, phd neuroscientist @thefreemanlab On Aug 12, 2014, at 2:20 PM, RJ Nowling rnowl...@gmail.com wrote: Hi all, I wanted to follow up. I have a prototype for an optimized version of hierarchical k-means. I wanted to get some feedback on my apporach. Jeremy's implementation splits the largest cluster in each round. Is it better to do it that way or to split each cluster in half? Are there are any open-source examples that are being widely used in production? Thanks! On Fri, Jul 18, 2014 at 8:05 AM, RJ Nowling rnowl...@gmail.com wrote: Nice to meet you, Jeremy! This is great! Hierarchical clustering was next on my list -- currently trying to get my PR for MiniBatch KMeans accepted. If it's cool with you, I'll try converting your code to fit in with the existing MLLib code as you suggest. I also need to review the Decision Tree code (as suggested above) to see how much of that can be reused. Maybe I can ask you to do a code review for me when I'm done? On Thu, Jul 17, 2014 at 8:31 PM, Jeremy Freeman freeman.jer...@gmail.com wrote: Hi all, Cool discussion! I agree that a more standardized API for clustering, and easy access to underlying routines, would be useful (we've also been discussing this when trying to develop streaming clustering algorithms, similar to https://github.com/apache/spark/pull/1361) For divisive, hierarchical clustering I implemented something awhile back, here's a gist. https://gist.github.com/freeman-lab/5947e7c53b368fe90371 It does bisecting k-means clustering (with k=2), with a recursive class for keeping track of the tree. I also found this much better than agglomerative methods (for the reasons Hector points out). This needs to be cleaned up, and can surely be optimized (esp. by replacing the core KMeans step with existing MLLib code), but I can say I was running it successfully on quite large data sets. RJ, depending on where you are in your progress, I'd be happy to help work on this piece and / or have you use this as a jumping off point, if useful. -- Jeremy -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Contributing-to-MLlib-Proposal-for-Clustering-Algorithms-tp7212p7398.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. -- em rnowl...@gmail.com c 954.496.2314 -- em rnowl...@gmail.com c 954.496.2314
Re: Contributing to MLlib: Proposal for Clustering Algorithms
Thanks, Jeremy. I'm abandoning my initial approach, and I'll work on optimizing your example (so it doesn't do the breeze-vector conversions every time KMeans is called). I need to finish a few other projects first, though, so it may be a couple weeks. In the mean time, Yu also created a JIRA for a hierarchical KMeans implementation. I pointed him to your example and a couple papers I found. If you or Yu beat me to getting an implementation in, I'd be happy to review it. :) On Wed, Aug 27, 2014 at 12:18 PM, Jeremy Freeman freeman.jer...@gmail.com wrote: Hey RJ, Sorry for the delay, I'd be happy to take a look at this if you can post the code! I think splitting the largest cluster in each round is fairly common, but ideally it would be an option to do it one way or the other. -- Jeremy - jeremy freeman, phd neuroscientist @thefreemanlab On Aug 12, 2014, at 2:20 PM, RJ Nowling rnowl...@gmail.com wrote: Hi all, I wanted to follow up. I have a prototype for an optimized version of hierarchical k-means. I wanted to get some feedback on my apporach. Jeremy's implementation splits the largest cluster in each round. Is it better to do it that way or to split each cluster in half? Are there are any open-source examples that are being widely used in production? Thanks! On Fri, Jul 18, 2014 at 8:05 AM, RJ Nowling rnowl...@gmail.com wrote: Nice to meet you, Jeremy! This is great! Hierarchical clustering was next on my list -- currently trying to get my PR for MiniBatch KMeans accepted. If it's cool with you, I'll try converting your code to fit in with the existing MLLib code as you suggest. I also need to review the Decision Tree code (as suggested above) to see how much of that can be reused. Maybe I can ask you to do a code review for me when I'm done? On Thu, Jul 17, 2014 at 8:31 PM, Jeremy Freeman freeman.jer...@gmail.com wrote: Hi all, Cool discussion! I agree that a more standardized API for clustering, and easy access to underlying routines, would be useful (we've also been discussing this when trying to develop streaming clustering algorithms, similar to https://github.com/apache/spark/pull/1361) For divisive, hierarchical clustering I implemented something awhile back, here's a gist. https://gist.github.com/freeman-lab/5947e7c53b368fe90371 It does bisecting k-means clustering (with k=2), with a recursive class for keeping track of the tree. I also found this much better than agglomerative methods (for the reasons Hector points out). This needs to be cleaned up, and can surely be optimized (esp. by replacing the core KMeans step with existing MLLib code), but I can say I was running it successfully on quite large data sets. RJ, depending on where you are in your progress, I'd be happy to help work on this piece and / or have you use this as a jumping off point, if useful. -- Jeremy -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Contributing-to-MLlib-Proposal-for-Clustering-Algorithms-tp7212p7398.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. -- em rnowl...@gmail.com c 954.496.2314 -- em rnowl...@gmail.com c 954.496.2314 -- em rnowl...@gmail.com c 954.496.2314
Re: Contributing to MLlib: Proposal for Clustering Algorithms
Hi all, I am also interested in specifying a common framework. And I am trying to implement a hierarchical k-means and a hierarchical clustering like single-link method with LSH. https://issues.apache.org/jira/browse/SPARK-2966 If you have designed the standardized clustering algorithms API, please let me know. best, Yu Ishikawa -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Contributing-to-MLlib-Proposal-for-Clustering-Algorithms-tp7212p7822.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: Contributing to MLlib: Proposal for Clustering Algorithms
Hi all, I wanted to follow up. I have a prototype for an optimized version of hierarchical k-means. I wanted to get some feedback on my apporach. Jeremy's implementation splits the largest cluster in each round. Is it better to do it that way or to split each cluster in half? Are there are any open-source examples that are being widely used in production? Thanks! On Fri, Jul 18, 2014 at 8:05 AM, RJ Nowling rnowl...@gmail.com wrote: Nice to meet you, Jeremy! This is great! Hierarchical clustering was next on my list -- currently trying to get my PR for MiniBatch KMeans accepted. If it's cool with you, I'll try converting your code to fit in with the existing MLLib code as you suggest. I also need to review the Decision Tree code (as suggested above) to see how much of that can be reused. Maybe I can ask you to do a code review for me when I'm done? On Thu, Jul 17, 2014 at 8:31 PM, Jeremy Freeman freeman.jer...@gmail.com wrote: Hi all, Cool discussion! I agree that a more standardized API for clustering, and easy access to underlying routines, would be useful (we've also been discussing this when trying to develop streaming clustering algorithms, similar to https://github.com/apache/spark/pull/1361) For divisive, hierarchical clustering I implemented something awhile back, here's a gist. https://gist.github.com/freeman-lab/5947e7c53b368fe90371 It does bisecting k-means clustering (with k=2), with a recursive class for keeping track of the tree. I also found this much better than agglomerative methods (for the reasons Hector points out). This needs to be cleaned up, and can surely be optimized (esp. by replacing the core KMeans step with existing MLLib code), but I can say I was running it successfully on quite large data sets. RJ, depending on where you are in your progress, I'd be happy to help work on this piece and / or have you use this as a jumping off point, if useful. -- Jeremy -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Contributing-to-MLlib-Proposal-for-Clustering-Algorithms-tp7212p7398.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. -- em rnowl...@gmail.com c 954.496.2314 -- em rnowl...@gmail.com c 954.496.2314
Re: Contributing to MLlib: Proposal for Clustering Algorithms
Hi RJ, that sounds like a great idea. I'd be happy to look over what you put together. -- Jeremy -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Contributing-to-MLlib-Proposal-for-Clustering-Algorithms-tp7212p7418.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com.
Re: Contributing to MLlib: Proposal for Clustering Algorithms
Nice to meet you, Jeremy! This is great! Hierarchical clustering was next on my list -- currently trying to get my PR for MiniBatch KMeans accepted. If it's cool with you, I'll try converting your code to fit in with the existing MLLib code as you suggest. I also need to review the Decision Tree code (as suggested above) to see how much of that can be reused. Maybe I can ask you to do a code review for me when I'm done? On Thu, Jul 17, 2014 at 8:31 PM, Jeremy Freeman freeman.jer...@gmail.com wrote: Hi all, Cool discussion! I agree that a more standardized API for clustering, and easy access to underlying routines, would be useful (we've also been discussing this when trying to develop streaming clustering algorithms, similar to https://github.com/apache/spark/pull/1361) For divisive, hierarchical clustering I implemented something awhile back, here's a gist. https://gist.github.com/freeman-lab/5947e7c53b368fe90371 It does bisecting k-means clustering (with k=2), with a recursive class for keeping track of the tree. I also found this much better than agglomerative methods (for the reasons Hector points out). This needs to be cleaned up, and can surely be optimized (esp. by replacing the core KMeans step with existing MLLib code), but I can say I was running it successfully on quite large data sets. RJ, depending on where you are in your progress, I'd be happy to help work on this piece and / or have you use this as a jumping off point, if useful. -- Jeremy -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Contributing-to-MLlib-Proposal-for-Clustering-Algorithms-tp7212p7398.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. -- em rnowl...@gmail.com c 954.496.2314
Re: Contributing to MLlib: Proposal for Clustering Algorithms
Hi all, Cool discussion! I agree that a more standardized API for clustering, and easy access to underlying routines, would be useful (we've also been discussing this when trying to develop streaming clustering algorithms, similar to https://github.com/apache/spark/pull/1361) For divisive, hierarchical clustering I implemented something awhile back, here's a gist. https://gist.github.com/freeman-lab/5947e7c53b368fe90371 It does bisecting k-means clustering (with k=2), with a recursive class for keeping track of the tree. I also found this much better than agglomerative methods (for the reasons Hector points out). This needs to be cleaned up, and can surely be optimized (esp. by replacing the core KMeans step with existing MLLib code), but I can say I was running it successfully on quite large data sets. RJ, depending on where you are in your progress, I'd be happy to help work on this piece and / or have you use this as a jumping off point, if useful. -- Jeremy -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Contributing-to-MLlib-Proposal-for-Clustering-Algorithms-tp7212p7398.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com.
Re: Contributing to MLlib: Proposal for Clustering Algorithms
I went ahead and created JIRAs. JIRA for Hierarchical Clustering: https://issues.apache.org/jira/browse/SPARK-2429 JIRA for Standarized Clustering APIs: https://issues.apache.org/jira/browse/SPARK-2430 Before submitting a PR for the standardized API, I want to implement a few clustering algorithms for myself to get a good feel for how to structure such an API. If others with more experience want to dive into designing the API, though, that would allow us to get moving more quickly. On Wed, Jul 9, 2014 at 8:39 AM, Nick Pentreath nick.pentre...@gmail.com wrote: Cool seems like a god initiative. Adding a couple extra high quality clustering implantations will be great. I'd say it would make most sense to submit a PR for the Standardised API first, agree that with everyone and then build on it for the specific implementations. — Sent from Mailbox On Wed, Jul 9, 2014 at 2:15 PM, RJ Nowling rnowl...@gmail.com wrote: Thanks everyone for the input. So it seems what people want is: * Implement MiniBatch KMeans and Hierarchical KMeans (Divide and conquer approach, look at DecisionTree implementation as a reference) * Restructure 3 Kmeans clustering algorithm implementations to prevent code duplication and conform to a consistent API where possible If this is correct, I'll start work on that. How would it be best to structure it? Should I submit separate JIRAs / PRs for refactoring of current code, MiniBatch KMeans, and Hierarchical or keep my current JIRA and PR for MiniBatch KMeans open and submit a second JIRA and PR for Hierarchical KMeans that builds on top of that? Thanks! On Tue, Jul 8, 2014 at 5:44 PM, Hector Yee hector@gmail.com wrote: Yeah if one were to replace the objective function in decision tree with minimizing the variance of the leaf nodes it would be a hierarchical clusterer. On Tue, Jul 8, 2014 at 2:12 PM, Evan R. Sparks evan.spa...@gmail.com wrote: If you're thinking along these lines, have a look at the DecisionTree implementation in MLlib. It uses the same idea and is optimized to prevent multiple passes over the data by computing several splits at each level of tree building. The tradeoff is increased model state and computation per pass over the data, but fewer total passes and hopefully lower communication overheads than, say, shuffling data around that belongs to one cluster or another. Something like that could work here as well. I'm not super-familiar with hierarchical K-Means so perhaps there's a more efficient way to implement it, though. On Tue, Jul 8, 2014 at 2:06 PM, Hector Yee hector@gmail.com wrote: No was thinking more top-down: assuming a distributed kmeans system already existing, recursively apply the kmeans algorithm on data already partitioned by the previous level of kmeans. I haven't been much of a fan of bottom up approaches like HAC mainly because they assume there is already a distance metric for items to items. This makes it hard to cluster new content. The distances between sibling clusters is also hard to compute (if you have thrown away the similarity matrix), do you count paths to same parent node if you are computing distances between items in two adjacent nodes for example. It is also a bit harder to distribute the computation for bottom up approaches as you have to already find the nearest neighbor to an item to begin the process. On Tue, Jul 8, 2014 at 1:59 PM, RJ Nowling rnowl...@gmail.com wrote: The scikit-learn implementation may be of interest: http://scikit-learn.org/stable/modules/generated/sklearn.cluster.Ward.html#sklearn.cluster.Ward It's a bottom up approach. The pair of clusters for merging are chosen to minimize variance. Their code is under a BSD license so it can be used as a template. Is something like that you were thinking Hector? On Tue, Jul 8, 2014 at 4:50 PM, Dmitriy Lyubimov dlie...@gmail.com wrote: sure. more interesting problem here is choosing k at each level. Kernel methods seem to be most promising. On Tue, Jul 8, 2014 at 1:31 PM, Hector Yee hector@gmail.com wrote: No idea, never looked it up. Always just implemented it as doing k-means again on each cluster. FWIW standard k-means with euclidean distance has problems too with some dimensionality reduction methods. Swapping out the distance metric with negative dot or cosine may help. Other more useful clustering would be hierarchical SVD. The reason why I like hierarchical clustering is it makes for faster inference especially over billions of users. On Tue, Jul 8, 2014 at 1:24 PM, Dmitriy Lyubimov dlie...@gmail.com wrote: Hector, could you share the references for hierarchical K-means? thanks. On Tue, Jul 8, 2014 at 1:01 PM, Hector Yee hector@gmail.com wrote: I would say for bigdata applications
Re: Contributing to MLlib: Proposal for Clustering Algorithms
Might be worth checking out scikit-learn and mahout to get some broad ideas— Sent from Mailbox On Thu, Jul 10, 2014 at 4:25 PM, RJ Nowling rnowl...@gmail.com wrote: I went ahead and created JIRAs. JIRA for Hierarchical Clustering: https://issues.apache.org/jira/browse/SPARK-2429 JIRA for Standarized Clustering APIs: https://issues.apache.org/jira/browse/SPARK-2430 Before submitting a PR for the standardized API, I want to implement a few clustering algorithms for myself to get a good feel for how to structure such an API. If others with more experience want to dive into designing the API, though, that would allow us to get moving more quickly. On Wed, Jul 9, 2014 at 8:39 AM, Nick Pentreath nick.pentre...@gmail.com wrote: Cool seems like a god initiative. Adding a couple extra high quality clustering implantations will be great. I'd say it would make most sense to submit a PR for the Standardised API first, agree that with everyone and then build on it for the specific implementations. — Sent from Mailbox On Wed, Jul 9, 2014 at 2:15 PM, RJ Nowling rnowl...@gmail.com wrote: Thanks everyone for the input. So it seems what people want is: * Implement MiniBatch KMeans and Hierarchical KMeans (Divide and conquer approach, look at DecisionTree implementation as a reference) * Restructure 3 Kmeans clustering algorithm implementations to prevent code duplication and conform to a consistent API where possible If this is correct, I'll start work on that. How would it be best to structure it? Should I submit separate JIRAs / PRs for refactoring of current code, MiniBatch KMeans, and Hierarchical or keep my current JIRA and PR for MiniBatch KMeans open and submit a second JIRA and PR for Hierarchical KMeans that builds on top of that? Thanks! On Tue, Jul 8, 2014 at 5:44 PM, Hector Yee hector@gmail.com wrote: Yeah if one were to replace the objective function in decision tree with minimizing the variance of the leaf nodes it would be a hierarchical clusterer. On Tue, Jul 8, 2014 at 2:12 PM, Evan R. Sparks evan.spa...@gmail.com wrote: If you're thinking along these lines, have a look at the DecisionTree implementation in MLlib. It uses the same idea and is optimized to prevent multiple passes over the data by computing several splits at each level of tree building. The tradeoff is increased model state and computation per pass over the data, but fewer total passes and hopefully lower communication overheads than, say, shuffling data around that belongs to one cluster or another. Something like that could work here as well. I'm not super-familiar with hierarchical K-Means so perhaps there's a more efficient way to implement it, though. On Tue, Jul 8, 2014 at 2:06 PM, Hector Yee hector@gmail.com wrote: No was thinking more top-down: assuming a distributed kmeans system already existing, recursively apply the kmeans algorithm on data already partitioned by the previous level of kmeans. I haven't been much of a fan of bottom up approaches like HAC mainly because they assume there is already a distance metric for items to items. This makes it hard to cluster new content. The distances between sibling clusters is also hard to compute (if you have thrown away the similarity matrix), do you count paths to same parent node if you are computing distances between items in two adjacent nodes for example. It is also a bit harder to distribute the computation for bottom up approaches as you have to already find the nearest neighbor to an item to begin the process. On Tue, Jul 8, 2014 at 1:59 PM, RJ Nowling rnowl...@gmail.com wrote: The scikit-learn implementation may be of interest: http://scikit-learn.org/stable/modules/generated/sklearn.cluster.Ward.html#sklearn.cluster.Ward It's a bottom up approach. The pair of clusters for merging are chosen to minimize variance. Their code is under a BSD license so it can be used as a template. Is something like that you were thinking Hector? On Tue, Jul 8, 2014 at 4:50 PM, Dmitriy Lyubimov dlie...@gmail.com wrote: sure. more interesting problem here is choosing k at each level. Kernel methods seem to be most promising. On Tue, Jul 8, 2014 at 1:31 PM, Hector Yee hector@gmail.com wrote: No idea, never looked it up. Always just implemented it as doing k-means again on each cluster. FWIW standard k-means with euclidean distance has problems too with some dimensionality reduction methods. Swapping out the distance metric with negative dot or cosine may help. Other more useful clustering would be hierarchical SVD. The reason why I like hierarchical clustering is it makes for faster inference especially over billions of users. On Tue, Jul 8, 2014 at 1:24 PM, Dmitriy Lyubimov dlie...@gmail.com wrote: Hector, could you share the
Re: Contributing to MLlib: Proposal for Clustering Algorithms
Thanks everyone for the input. So it seems what people want is: * Implement MiniBatch KMeans and Hierarchical KMeans (Divide and conquer approach, look at DecisionTree implementation as a reference) * Restructure 3 Kmeans clustering algorithm implementations to prevent code duplication and conform to a consistent API where possible If this is correct, I'll start work on that. How would it be best to structure it? Should I submit separate JIRAs / PRs for refactoring of current code, MiniBatch KMeans, and Hierarchical or keep my current JIRA and PR for MiniBatch KMeans open and submit a second JIRA and PR for Hierarchical KMeans that builds on top of that? Thanks! On Tue, Jul 8, 2014 at 5:44 PM, Hector Yee hector@gmail.com wrote: Yeah if one were to replace the objective function in decision tree with minimizing the variance of the leaf nodes it would be a hierarchical clusterer. On Tue, Jul 8, 2014 at 2:12 PM, Evan R. Sparks evan.spa...@gmail.com wrote: If you're thinking along these lines, have a look at the DecisionTree implementation in MLlib. It uses the same idea and is optimized to prevent multiple passes over the data by computing several splits at each level of tree building. The tradeoff is increased model state and computation per pass over the data, but fewer total passes and hopefully lower communication overheads than, say, shuffling data around that belongs to one cluster or another. Something like that could work here as well. I'm not super-familiar with hierarchical K-Means so perhaps there's a more efficient way to implement it, though. On Tue, Jul 8, 2014 at 2:06 PM, Hector Yee hector@gmail.com wrote: No was thinking more top-down: assuming a distributed kmeans system already existing, recursively apply the kmeans algorithm on data already partitioned by the previous level of kmeans. I haven't been much of a fan of bottom up approaches like HAC mainly because they assume there is already a distance metric for items to items. This makes it hard to cluster new content. The distances between sibling clusters is also hard to compute (if you have thrown away the similarity matrix), do you count paths to same parent node if you are computing distances between items in two adjacent nodes for example. It is also a bit harder to distribute the computation for bottom up approaches as you have to already find the nearest neighbor to an item to begin the process. On Tue, Jul 8, 2014 at 1:59 PM, RJ Nowling rnowl...@gmail.com wrote: The scikit-learn implementation may be of interest: http://scikit-learn.org/stable/modules/generated/sklearn.cluster.Ward.html#sklearn.cluster.Ward It's a bottom up approach. The pair of clusters for merging are chosen to minimize variance. Their code is under a BSD license so it can be used as a template. Is something like that you were thinking Hector? On Tue, Jul 8, 2014 at 4:50 PM, Dmitriy Lyubimov dlie...@gmail.com wrote: sure. more interesting problem here is choosing k at each level. Kernel methods seem to be most promising. On Tue, Jul 8, 2014 at 1:31 PM, Hector Yee hector@gmail.com wrote: No idea, never looked it up. Always just implemented it as doing k-means again on each cluster. FWIW standard k-means with euclidean distance has problems too with some dimensionality reduction methods. Swapping out the distance metric with negative dot or cosine may help. Other more useful clustering would be hierarchical SVD. The reason why I like hierarchical clustering is it makes for faster inference especially over billions of users. On Tue, Jul 8, 2014 at 1:24 PM, Dmitriy Lyubimov dlie...@gmail.com wrote: Hector, could you share the references for hierarchical K-means? thanks. On Tue, Jul 8, 2014 at 1:01 PM, Hector Yee hector@gmail.com wrote: I would say for bigdata applications the most useful would be hierarchical k-means with back tracking and the ability to support k nearest centroids. On Tue, Jul 8, 2014 at 10:54 AM, RJ Nowling rnowl...@gmail.com wrote: Hi all, MLlib currently has one clustering algorithm implementation, KMeans. It would benefit from having implementations of other clustering algorithms such as MiniBatch KMeans, Fuzzy C-Means, Hierarchical Clustering, and Affinity Propagation. I recently submitted a PR [1] for a MiniBatch KMeans implementation, and I saw an email on this list about interest in implementing Fuzzy C-Means. Based on Sean Owen's review of my MiniBatch KMeans code, it became apparent that before I implement more clustering algorithms, it would be useful to hammer out a framework to reduce code duplication and
Re: Contributing to MLlib: Proposal for Clustering Algorithms
Cool seems like a god initiative. Adding a couple extra high quality clustering implantations will be great. I'd say it would make most sense to submit a PR for the Standardised API first, agree that with everyone and then build on it for the specific implementations. — Sent from Mailbox On Wed, Jul 9, 2014 at 2:15 PM, RJ Nowling rnowl...@gmail.com wrote: Thanks everyone for the input. So it seems what people want is: * Implement MiniBatch KMeans and Hierarchical KMeans (Divide and conquer approach, look at DecisionTree implementation as a reference) * Restructure 3 Kmeans clustering algorithm implementations to prevent code duplication and conform to a consistent API where possible If this is correct, I'll start work on that. How would it be best to structure it? Should I submit separate JIRAs / PRs for refactoring of current code, MiniBatch KMeans, and Hierarchical or keep my current JIRA and PR for MiniBatch KMeans open and submit a second JIRA and PR for Hierarchical KMeans that builds on top of that? Thanks! On Tue, Jul 8, 2014 at 5:44 PM, Hector Yee hector@gmail.com wrote: Yeah if one were to replace the objective function in decision tree with minimizing the variance of the leaf nodes it would be a hierarchical clusterer. On Tue, Jul 8, 2014 at 2:12 PM, Evan R. Sparks evan.spa...@gmail.com wrote: If you're thinking along these lines, have a look at the DecisionTree implementation in MLlib. It uses the same idea and is optimized to prevent multiple passes over the data by computing several splits at each level of tree building. The tradeoff is increased model state and computation per pass over the data, but fewer total passes and hopefully lower communication overheads than, say, shuffling data around that belongs to one cluster or another. Something like that could work here as well. I'm not super-familiar with hierarchical K-Means so perhaps there's a more efficient way to implement it, though. On Tue, Jul 8, 2014 at 2:06 PM, Hector Yee hector@gmail.com wrote: No was thinking more top-down: assuming a distributed kmeans system already existing, recursively apply the kmeans algorithm on data already partitioned by the previous level of kmeans. I haven't been much of a fan of bottom up approaches like HAC mainly because they assume there is already a distance metric for items to items. This makes it hard to cluster new content. The distances between sibling clusters is also hard to compute (if you have thrown away the similarity matrix), do you count paths to same parent node if you are computing distances between items in two adjacent nodes for example. It is also a bit harder to distribute the computation for bottom up approaches as you have to already find the nearest neighbor to an item to begin the process. On Tue, Jul 8, 2014 at 1:59 PM, RJ Nowling rnowl...@gmail.com wrote: The scikit-learn implementation may be of interest: http://scikit-learn.org/stable/modules/generated/sklearn.cluster.Ward.html#sklearn.cluster.Ward It's a bottom up approach. The pair of clusters for merging are chosen to minimize variance. Their code is under a BSD license so it can be used as a template. Is something like that you were thinking Hector? On Tue, Jul 8, 2014 at 4:50 PM, Dmitriy Lyubimov dlie...@gmail.com wrote: sure. more interesting problem here is choosing k at each level. Kernel methods seem to be most promising. On Tue, Jul 8, 2014 at 1:31 PM, Hector Yee hector@gmail.com wrote: No idea, never looked it up. Always just implemented it as doing k-means again on each cluster. FWIW standard k-means with euclidean distance has problems too with some dimensionality reduction methods. Swapping out the distance metric with negative dot or cosine may help. Other more useful clustering would be hierarchical SVD. The reason why I like hierarchical clustering is it makes for faster inference especially over billions of users. On Tue, Jul 8, 2014 at 1:24 PM, Dmitriy Lyubimov dlie...@gmail.com wrote: Hector, could you share the references for hierarchical K-means? thanks. On Tue, Jul 8, 2014 at 1:01 PM, Hector Yee hector@gmail.com wrote: I would say for bigdata applications the most useful would be hierarchical k-means with back tracking and the ability to support k nearest centroids. On Tue, Jul 8, 2014 at 10:54 AM, RJ Nowling rnowl...@gmail.com wrote: Hi all, MLlib currently has one clustering algorithm implementation, KMeans. It would benefit from having implementations of other clustering algorithms such as MiniBatch KMeans, Fuzzy C-Means, Hierarchical Clustering, and Affinity Propagation. I recently submitted a PR [1] for a MiniBatch
Contributing to MLlib: Proposal for Clustering Algorithms
Hi all, MLlib currently has one clustering algorithm implementation, KMeans. It would benefit from having implementations of other clustering algorithms such as MiniBatch KMeans, Fuzzy C-Means, Hierarchical Clustering, and Affinity Propagation. I recently submitted a PR [1] for a MiniBatch KMeans implementation, and I saw an email on this list about interest in implementing Fuzzy C-Means. Based on Sean Owen's review of my MiniBatch KMeans code, it became apparent that before I implement more clustering algorithms, it would be useful to hammer out a framework to reduce code duplication and implement a consistent API. I'd like to gauge the interest and goals of the MLlib community: 1. Are you interested in having more clustering algorithms available? 2. Is the community interested in specifying a common framework? Thanks! RJ [1] - https://github.com/apache/spark/pull/1248 -- em rnowl...@gmail.com c 954.496.2314
Re: Contributing to MLlib: Proposal for Clustering Algorithms
I would say for bigdata applications the most useful would be hierarchical k-means with back tracking and the ability to support k nearest centroids. On Tue, Jul 8, 2014 at 10:54 AM, RJ Nowling rnowl...@gmail.com wrote: Hi all, MLlib currently has one clustering algorithm implementation, KMeans. It would benefit from having implementations of other clustering algorithms such as MiniBatch KMeans, Fuzzy C-Means, Hierarchical Clustering, and Affinity Propagation. I recently submitted a PR [1] for a MiniBatch KMeans implementation, and I saw an email on this list about interest in implementing Fuzzy C-Means. Based on Sean Owen's review of my MiniBatch KMeans code, it became apparent that before I implement more clustering algorithms, it would be useful to hammer out a framework to reduce code duplication and implement a consistent API. I'd like to gauge the interest and goals of the MLlib community: 1. Are you interested in having more clustering algorithms available? 2. Is the community interested in specifying a common framework? Thanks! RJ [1] - https://github.com/apache/spark/pull/1248 -- em rnowl...@gmail.com c 954.496.2314 -- Yee Yang Li Hector http://google.com/+HectorYee *google.com/+HectorYee http://google.com/+HectorYee*
Re: Contributing to MLlib: Proposal for Clustering Algorithms
Thanks, Hector! Your feedback is useful. On Tuesday, July 8, 2014, Hector Yee hector@gmail.com wrote: I would say for bigdata applications the most useful would be hierarchical k-means with back tracking and the ability to support k nearest centroids. On Tue, Jul 8, 2014 at 10:54 AM, RJ Nowling rnowl...@gmail.com javascript:; wrote: Hi all, MLlib currently has one clustering algorithm implementation, KMeans. It would benefit from having implementations of other clustering algorithms such as MiniBatch KMeans, Fuzzy C-Means, Hierarchical Clustering, and Affinity Propagation. I recently submitted a PR [1] for a MiniBatch KMeans implementation, and I saw an email on this list about interest in implementing Fuzzy C-Means. Based on Sean Owen's review of my MiniBatch KMeans code, it became apparent that before I implement more clustering algorithms, it would be useful to hammer out a framework to reduce code duplication and implement a consistent API. I'd like to gauge the interest and goals of the MLlib community: 1. Are you interested in having more clustering algorithms available? 2. Is the community interested in specifying a common framework? Thanks! RJ [1] - https://github.com/apache/spark/pull/1248 -- em rnowl...@gmail.com javascript:; c 954.496.2314 -- Yee Yang Li Hector http://google.com/+HectorYee *google.com/+HectorYee http://google.com/+HectorYee* -- em rnowl...@gmail.com c 954.496.2314
Re: Contributing to MLlib: Proposal for Clustering Algorithms
Hector, could you share the references for hierarchical K-means? thanks. On Tue, Jul 8, 2014 at 1:01 PM, Hector Yee hector@gmail.com wrote: I would say for bigdata applications the most useful would be hierarchical k-means with back tracking and the ability to support k nearest centroids. On Tue, Jul 8, 2014 at 10:54 AM, RJ Nowling rnowl...@gmail.com wrote: Hi all, MLlib currently has one clustering algorithm implementation, KMeans. It would benefit from having implementations of other clustering algorithms such as MiniBatch KMeans, Fuzzy C-Means, Hierarchical Clustering, and Affinity Propagation. I recently submitted a PR [1] for a MiniBatch KMeans implementation, and I saw an email on this list about interest in implementing Fuzzy C-Means. Based on Sean Owen's review of my MiniBatch KMeans code, it became apparent that before I implement more clustering algorithms, it would be useful to hammer out a framework to reduce code duplication and implement a consistent API. I'd like to gauge the interest and goals of the MLlib community: 1. Are you interested in having more clustering algorithms available? 2. Is the community interested in specifying a common framework? Thanks! RJ [1] - https://github.com/apache/spark/pull/1248 -- em rnowl...@gmail.com c 954.496.2314 -- Yee Yang Li Hector http://google.com/+HectorYee *google.com/+HectorYee http://google.com/+HectorYee*
Re: Contributing to MLlib: Proposal for Clustering Algorithms
Having a common framework for clustering makes sense to me. While we should be careful about what algorithms we include, having solid implementations of minibatch clustering and hierarchical clustering seems like a worthwhile goal, and we should reuse as much code and APIs as reasonable. On Tue, Jul 8, 2014 at 1:19 PM, RJ Nowling rnowl...@gmail.com wrote: Thanks, Hector! Your feedback is useful. On Tuesday, July 8, 2014, Hector Yee hector@gmail.com wrote: I would say for bigdata applications the most useful would be hierarchical k-means with back tracking and the ability to support k nearest centroids. On Tue, Jul 8, 2014 at 10:54 AM, RJ Nowling rnowl...@gmail.com javascript:; wrote: Hi all, MLlib currently has one clustering algorithm implementation, KMeans. It would benefit from having implementations of other clustering algorithms such as MiniBatch KMeans, Fuzzy C-Means, Hierarchical Clustering, and Affinity Propagation. I recently submitted a PR [1] for a MiniBatch KMeans implementation, and I saw an email on this list about interest in implementing Fuzzy C-Means. Based on Sean Owen's review of my MiniBatch KMeans code, it became apparent that before I implement more clustering algorithms, it would be useful to hammer out a framework to reduce code duplication and implement a consistent API. I'd like to gauge the interest and goals of the MLlib community: 1. Are you interested in having more clustering algorithms available? 2. Is the community interested in specifying a common framework? Thanks! RJ [1] - https://github.com/apache/spark/pull/1248 -- em rnowl...@gmail.com javascript:; c 954.496.2314 -- Yee Yang Li Hector http://google.com/+HectorYee *google.com/+HectorYee http://google.com/+HectorYee* -- em rnowl...@gmail.com c 954.496.2314
Re: Contributing to MLlib: Proposal for Clustering Algorithms
No idea, never looked it up. Always just implemented it as doing k-means again on each cluster. FWIW standard k-means with euclidean distance has problems too with some dimensionality reduction methods. Swapping out the distance metric with negative dot or cosine may help. Other more useful clustering would be hierarchical SVD. The reason why I like hierarchical clustering is it makes for faster inference especially over billions of users. On Tue, Jul 8, 2014 at 1:24 PM, Dmitriy Lyubimov dlie...@gmail.com wrote: Hector, could you share the references for hierarchical K-means? thanks. On Tue, Jul 8, 2014 at 1:01 PM, Hector Yee hector@gmail.com wrote: I would say for bigdata applications the most useful would be hierarchical k-means with back tracking and the ability to support k nearest centroids. On Tue, Jul 8, 2014 at 10:54 AM, RJ Nowling rnowl...@gmail.com wrote: Hi all, MLlib currently has one clustering algorithm implementation, KMeans. It would benefit from having implementations of other clustering algorithms such as MiniBatch KMeans, Fuzzy C-Means, Hierarchical Clustering, and Affinity Propagation. I recently submitted a PR [1] for a MiniBatch KMeans implementation, and I saw an email on this list about interest in implementing Fuzzy C-Means. Based on Sean Owen's review of my MiniBatch KMeans code, it became apparent that before I implement more clustering algorithms, it would be useful to hammer out a framework to reduce code duplication and implement a consistent API. I'd like to gauge the interest and goals of the MLlib community: 1. Are you interested in having more clustering algorithms available? 2. Is the community interested in specifying a common framework? Thanks! RJ [1] - https://github.com/apache/spark/pull/1248 -- em rnowl...@gmail.com c 954.496.2314 -- Yee Yang Li Hector http://google.com/+HectorYee *google.com/+HectorYee http://google.com/+HectorYee* -- Yee Yang Li Hector http://google.com/+HectorYee *google.com/+HectorYee http://google.com/+HectorYee*
Re: Contributing to MLlib: Proposal for Clustering Algorithms
sure. more interesting problem here is choosing k at each level. Kernel methods seem to be most promising. On Tue, Jul 8, 2014 at 1:31 PM, Hector Yee hector@gmail.com wrote: No idea, never looked it up. Always just implemented it as doing k-means again on each cluster. FWIW standard k-means with euclidean distance has problems too with some dimensionality reduction methods. Swapping out the distance metric with negative dot or cosine may help. Other more useful clustering would be hierarchical SVD. The reason why I like hierarchical clustering is it makes for faster inference especially over billions of users. On Tue, Jul 8, 2014 at 1:24 PM, Dmitriy Lyubimov dlie...@gmail.com wrote: Hector, could you share the references for hierarchical K-means? thanks. On Tue, Jul 8, 2014 at 1:01 PM, Hector Yee hector@gmail.com wrote: I would say for bigdata applications the most useful would be hierarchical k-means with back tracking and the ability to support k nearest centroids. On Tue, Jul 8, 2014 at 10:54 AM, RJ Nowling rnowl...@gmail.com wrote: Hi all, MLlib currently has one clustering algorithm implementation, KMeans. It would benefit from having implementations of other clustering algorithms such as MiniBatch KMeans, Fuzzy C-Means, Hierarchical Clustering, and Affinity Propagation. I recently submitted a PR [1] for a MiniBatch KMeans implementation, and I saw an email on this list about interest in implementing Fuzzy C-Means. Based on Sean Owen's review of my MiniBatch KMeans code, it became apparent that before I implement more clustering algorithms, it would be useful to hammer out a framework to reduce code duplication and implement a consistent API. I'd like to gauge the interest and goals of the MLlib community: 1. Are you interested in having more clustering algorithms available? 2. Is the community interested in specifying a common framework? Thanks! RJ [1] - https://github.com/apache/spark/pull/1248 -- em rnowl...@gmail.com c 954.496.2314 -- Yee Yang Li Hector http://google.com/+HectorYee *google.com/+HectorYee http://google.com/+HectorYee* -- Yee Yang Li Hector http://google.com/+HectorYee *google.com/+HectorYee http://google.com/+HectorYee*
Re: Contributing to MLlib: Proposal for Clustering Algorithms
The scikit-learn implementation may be of interest: http://scikit-learn.org/stable/modules/generated/sklearn.cluster.Ward.html#sklearn.cluster.Ward It's a bottom up approach. The pair of clusters for merging are chosen to minimize variance. Their code is under a BSD license so it can be used as a template. Is something like that you were thinking Hector? On Tue, Jul 8, 2014 at 4:50 PM, Dmitriy Lyubimov dlie...@gmail.com wrote: sure. more interesting problem here is choosing k at each level. Kernel methods seem to be most promising. On Tue, Jul 8, 2014 at 1:31 PM, Hector Yee hector@gmail.com wrote: No idea, never looked it up. Always just implemented it as doing k-means again on each cluster. FWIW standard k-means with euclidean distance has problems too with some dimensionality reduction methods. Swapping out the distance metric with negative dot or cosine may help. Other more useful clustering would be hierarchical SVD. The reason why I like hierarchical clustering is it makes for faster inference especially over billions of users. On Tue, Jul 8, 2014 at 1:24 PM, Dmitriy Lyubimov dlie...@gmail.com wrote: Hector, could you share the references for hierarchical K-means? thanks. On Tue, Jul 8, 2014 at 1:01 PM, Hector Yee hector@gmail.com wrote: I would say for bigdata applications the most useful would be hierarchical k-means with back tracking and the ability to support k nearest centroids. On Tue, Jul 8, 2014 at 10:54 AM, RJ Nowling rnowl...@gmail.com wrote: Hi all, MLlib currently has one clustering algorithm implementation, KMeans. It would benefit from having implementations of other clustering algorithms such as MiniBatch KMeans, Fuzzy C-Means, Hierarchical Clustering, and Affinity Propagation. I recently submitted a PR [1] for a MiniBatch KMeans implementation, and I saw an email on this list about interest in implementing Fuzzy C-Means. Based on Sean Owen's review of my MiniBatch KMeans code, it became apparent that before I implement more clustering algorithms, it would be useful to hammer out a framework to reduce code duplication and implement a consistent API. I'd like to gauge the interest and goals of the MLlib community: 1. Are you interested in having more clustering algorithms available? 2. Is the community interested in specifying a common framework? Thanks! RJ [1] - https://github.com/apache/spark/pull/1248 -- em rnowl...@gmail.com c 954.496.2314 -- Yee Yang Li Hector http://google.com/+HectorYee *google.com/+HectorYee http://google.com/+HectorYee* -- Yee Yang Li Hector http://google.com/+HectorYee *google.com/+HectorYee http://google.com/+HectorYee* -- em rnowl...@gmail.com c 954.496.2314
Re: Contributing to MLlib: Proposal for Clustering Algorithms
K doesn't matter much I've tried anything from 2^10 to 10^3 and the performance doesn't change much as measured by precision @ K. (see table 1 http://machinelearning.wustl.edu/mlpapers/papers/weston13). Although 10^3 kmeans did outperform 2^10 hierarchical SVD slightly in terms of the metrics, 2^10 SVD was much faster in terms of inference time. I found the thing that affected performance most was adding in back tracking to fix mistakes made at higher levels rather than how the K is picked per level. On Tue, Jul 8, 2014 at 1:50 PM, Dmitriy Lyubimov dlie...@gmail.com wrote: sure. more interesting problem here is choosing k at each level. Kernel methods seem to be most promising. On Tue, Jul 8, 2014 at 1:31 PM, Hector Yee hector@gmail.com wrote: No idea, never looked it up. Always just implemented it as doing k-means again on each cluster. FWIW standard k-means with euclidean distance has problems too with some dimensionality reduction methods. Swapping out the distance metric with negative dot or cosine may help. Other more useful clustering would be hierarchical SVD. The reason why I like hierarchical clustering is it makes for faster inference especially over billions of users. On Tue, Jul 8, 2014 at 1:24 PM, Dmitriy Lyubimov dlie...@gmail.com wrote: Hector, could you share the references for hierarchical K-means? thanks. On Tue, Jul 8, 2014 at 1:01 PM, Hector Yee hector@gmail.com wrote: I would say for bigdata applications the most useful would be hierarchical k-means with back tracking and the ability to support k nearest centroids. On Tue, Jul 8, 2014 at 10:54 AM, RJ Nowling rnowl...@gmail.com wrote: Hi all, MLlib currently has one clustering algorithm implementation, KMeans. It would benefit from having implementations of other clustering algorithms such as MiniBatch KMeans, Fuzzy C-Means, Hierarchical Clustering, and Affinity Propagation. I recently submitted a PR [1] for a MiniBatch KMeans implementation, and I saw an email on this list about interest in implementing Fuzzy C-Means. Based on Sean Owen's review of my MiniBatch KMeans code, it became apparent that before I implement more clustering algorithms, it would be useful to hammer out a framework to reduce code duplication and implement a consistent API. I'd like to gauge the interest and goals of the MLlib community: 1. Are you interested in having more clustering algorithms available? 2. Is the community interested in specifying a common framework? Thanks! RJ [1] - https://github.com/apache/spark/pull/1248 -- em rnowl...@gmail.com c 954.496.2314 -- Yee Yang Li Hector http://google.com/+HectorYee *google.com/+HectorYee http://google.com/+HectorYee* -- Yee Yang Li Hector http://google.com/+HectorYee *google.com/+HectorYee http://google.com/+HectorYee* -- Yee Yang Li Hector http://google.com/+HectorYee *google.com/+HectorYee http://google.com/+HectorYee*
Re: Contributing to MLlib: Proposal for Clustering Algorithms
No was thinking more top-down: assuming a distributed kmeans system already existing, recursively apply the kmeans algorithm on data already partitioned by the previous level of kmeans. I haven't been much of a fan of bottom up approaches like HAC mainly because they assume there is already a distance metric for items to items. This makes it hard to cluster new content. The distances between sibling clusters is also hard to compute (if you have thrown away the similarity matrix), do you count paths to same parent node if you are computing distances between items in two adjacent nodes for example. It is also a bit harder to distribute the computation for bottom up approaches as you have to already find the nearest neighbor to an item to begin the process. On Tue, Jul 8, 2014 at 1:59 PM, RJ Nowling rnowl...@gmail.com wrote: The scikit-learn implementation may be of interest: http://scikit-learn.org/stable/modules/generated/sklearn.cluster.Ward.html#sklearn.cluster.Ward It's a bottom up approach. The pair of clusters for merging are chosen to minimize variance. Their code is under a BSD license so it can be used as a template. Is something like that you were thinking Hector? On Tue, Jul 8, 2014 at 4:50 PM, Dmitriy Lyubimov dlie...@gmail.com wrote: sure. more interesting problem here is choosing k at each level. Kernel methods seem to be most promising. On Tue, Jul 8, 2014 at 1:31 PM, Hector Yee hector@gmail.com wrote: No idea, never looked it up. Always just implemented it as doing k-means again on each cluster. FWIW standard k-means with euclidean distance has problems too with some dimensionality reduction methods. Swapping out the distance metric with negative dot or cosine may help. Other more useful clustering would be hierarchical SVD. The reason why I like hierarchical clustering is it makes for faster inference especially over billions of users. On Tue, Jul 8, 2014 at 1:24 PM, Dmitriy Lyubimov dlie...@gmail.com wrote: Hector, could you share the references for hierarchical K-means? thanks. On Tue, Jul 8, 2014 at 1:01 PM, Hector Yee hector@gmail.com wrote: I would say for bigdata applications the most useful would be hierarchical k-means with back tracking and the ability to support k nearest centroids. On Tue, Jul 8, 2014 at 10:54 AM, RJ Nowling rnowl...@gmail.com wrote: Hi all, MLlib currently has one clustering algorithm implementation, KMeans. It would benefit from having implementations of other clustering algorithms such as MiniBatch KMeans, Fuzzy C-Means, Hierarchical Clustering, and Affinity Propagation. I recently submitted a PR [1] for a MiniBatch KMeans implementation, and I saw an email on this list about interest in implementing Fuzzy C-Means. Based on Sean Owen's review of my MiniBatch KMeans code, it became apparent that before I implement more clustering algorithms, it would be useful to hammer out a framework to reduce code duplication and implement a consistent API. I'd like to gauge the interest and goals of the MLlib community: 1. Are you interested in having more clustering algorithms available? 2. Is the community interested in specifying a common framework? Thanks! RJ [1] - https://github.com/apache/spark/pull/1248 -- em rnowl...@gmail.com c 954.496.2314 -- Yee Yang Li Hector http://google.com/+HectorYee *google.com/+HectorYee http://google.com/+HectorYee* -- Yee Yang Li Hector http://google.com/+HectorYee *google.com/+HectorYee http://google.com/+HectorYee* -- em rnowl...@gmail.com c 954.496.2314 -- Yee Yang Li Hector http://google.com/+HectorYee *google.com/+HectorYee http://google.com/+HectorYee*
Re: Contributing to MLlib: Proposal for Clustering Algorithms
If you're thinking along these lines, have a look at the DecisionTree implementation in MLlib. It uses the same idea and is optimized to prevent multiple passes over the data by computing several splits at each level of tree building. The tradeoff is increased model state and computation per pass over the data, but fewer total passes and hopefully lower communication overheads than, say, shuffling data around that belongs to one cluster or another. Something like that could work here as well. I'm not super-familiar with hierarchical K-Means so perhaps there's a more efficient way to implement it, though. On Tue, Jul 8, 2014 at 2:06 PM, Hector Yee hector@gmail.com wrote: No was thinking more top-down: assuming a distributed kmeans system already existing, recursively apply the kmeans algorithm on data already partitioned by the previous level of kmeans. I haven't been much of a fan of bottom up approaches like HAC mainly because they assume there is already a distance metric for items to items. This makes it hard to cluster new content. The distances between sibling clusters is also hard to compute (if you have thrown away the similarity matrix), do you count paths to same parent node if you are computing distances between items in two adjacent nodes for example. It is also a bit harder to distribute the computation for bottom up approaches as you have to already find the nearest neighbor to an item to begin the process. On Tue, Jul 8, 2014 at 1:59 PM, RJ Nowling rnowl...@gmail.com wrote: The scikit-learn implementation may be of interest: http://scikit-learn.org/stable/modules/generated/sklearn.cluster.Ward.html#sklearn.cluster.Ward It's a bottom up approach. The pair of clusters for merging are chosen to minimize variance. Their code is under a BSD license so it can be used as a template. Is something like that you were thinking Hector? On Tue, Jul 8, 2014 at 4:50 PM, Dmitriy Lyubimov dlie...@gmail.com wrote: sure. more interesting problem here is choosing k at each level. Kernel methods seem to be most promising. On Tue, Jul 8, 2014 at 1:31 PM, Hector Yee hector@gmail.com wrote: No idea, never looked it up. Always just implemented it as doing k-means again on each cluster. FWIW standard k-means with euclidean distance has problems too with some dimensionality reduction methods. Swapping out the distance metric with negative dot or cosine may help. Other more useful clustering would be hierarchical SVD. The reason why I like hierarchical clustering is it makes for faster inference especially over billions of users. On Tue, Jul 8, 2014 at 1:24 PM, Dmitriy Lyubimov dlie...@gmail.com wrote: Hector, could you share the references for hierarchical K-means? thanks. On Tue, Jul 8, 2014 at 1:01 PM, Hector Yee hector@gmail.com wrote: I would say for bigdata applications the most useful would be hierarchical k-means with back tracking and the ability to support k nearest centroids. On Tue, Jul 8, 2014 at 10:54 AM, RJ Nowling rnowl...@gmail.com wrote: Hi all, MLlib currently has one clustering algorithm implementation, KMeans. It would benefit from having implementations of other clustering algorithms such as MiniBatch KMeans, Fuzzy C-Means, Hierarchical Clustering, and Affinity Propagation. I recently submitted a PR [1] for a MiniBatch KMeans implementation, and I saw an email on this list about interest in implementing Fuzzy C-Means. Based on Sean Owen's review of my MiniBatch KMeans code, it became apparent that before I implement more clustering algorithms, it would be useful to hammer out a framework to reduce code duplication and implement a consistent API. I'd like to gauge the interest and goals of the MLlib community: 1. Are you interested in having more clustering algorithms available? 2. Is the community interested in specifying a common framework? Thanks! RJ [1] - https://github.com/apache/spark/pull/1248 -- em rnowl...@gmail.com c 954.496.2314 -- Yee Yang Li Hector http://google.com/+HectorYee *google.com/+HectorYee http://google.com/+HectorYee* -- Yee Yang Li Hector http://google.com/+HectorYee *google.com/+HectorYee http://google.com/+HectorYee* -- em rnowl...@gmail.com c 954.496.2314 -- Yee Yang Li Hector http://google.com/+HectorYee *google.com/+HectorYee http://google.com/+HectorYee*
Re: Contributing to MLlib: Proposal for Clustering Algorithms
Yeah if one were to replace the objective function in decision tree with minimizing the variance of the leaf nodes it would be a hierarchical clusterer. On Tue, Jul 8, 2014 at 2:12 PM, Evan R. Sparks evan.spa...@gmail.com wrote: If you're thinking along these lines, have a look at the DecisionTree implementation in MLlib. It uses the same idea and is optimized to prevent multiple passes over the data by computing several splits at each level of tree building. The tradeoff is increased model state and computation per pass over the data, but fewer total passes and hopefully lower communication overheads than, say, shuffling data around that belongs to one cluster or another. Something like that could work here as well. I'm not super-familiar with hierarchical K-Means so perhaps there's a more efficient way to implement it, though. On Tue, Jul 8, 2014 at 2:06 PM, Hector Yee hector@gmail.com wrote: No was thinking more top-down: assuming a distributed kmeans system already existing, recursively apply the kmeans algorithm on data already partitioned by the previous level of kmeans. I haven't been much of a fan of bottom up approaches like HAC mainly because they assume there is already a distance metric for items to items. This makes it hard to cluster new content. The distances between sibling clusters is also hard to compute (if you have thrown away the similarity matrix), do you count paths to same parent node if you are computing distances between items in two adjacent nodes for example. It is also a bit harder to distribute the computation for bottom up approaches as you have to already find the nearest neighbor to an item to begin the process. On Tue, Jul 8, 2014 at 1:59 PM, RJ Nowling rnowl...@gmail.com wrote: The scikit-learn implementation may be of interest: http://scikit-learn.org/stable/modules/generated/sklearn.cluster.Ward.html#sklearn.cluster.Ward It's a bottom up approach. The pair of clusters for merging are chosen to minimize variance. Their code is under a BSD license so it can be used as a template. Is something like that you were thinking Hector? On Tue, Jul 8, 2014 at 4:50 PM, Dmitriy Lyubimov dlie...@gmail.com wrote: sure. more interesting problem here is choosing k at each level. Kernel methods seem to be most promising. On Tue, Jul 8, 2014 at 1:31 PM, Hector Yee hector@gmail.com wrote: No idea, never looked it up. Always just implemented it as doing k-means again on each cluster. FWIW standard k-means with euclidean distance has problems too with some dimensionality reduction methods. Swapping out the distance metric with negative dot or cosine may help. Other more useful clustering would be hierarchical SVD. The reason why I like hierarchical clustering is it makes for faster inference especially over billions of users. On Tue, Jul 8, 2014 at 1:24 PM, Dmitriy Lyubimov dlie...@gmail.com wrote: Hector, could you share the references for hierarchical K-means? thanks. On Tue, Jul 8, 2014 at 1:01 PM, Hector Yee hector@gmail.com wrote: I would say for bigdata applications the most useful would be hierarchical k-means with back tracking and the ability to support k nearest centroids. On Tue, Jul 8, 2014 at 10:54 AM, RJ Nowling rnowl...@gmail.com wrote: Hi all, MLlib currently has one clustering algorithm implementation, KMeans. It would benefit from having implementations of other clustering algorithms such as MiniBatch KMeans, Fuzzy C-Means, Hierarchical Clustering, and Affinity Propagation. I recently submitted a PR [1] for a MiniBatch KMeans implementation, and I saw an email on this list about interest in implementing Fuzzy C-Means. Based on Sean Owen's review of my MiniBatch KMeans code, it became apparent that before I implement more clustering algorithms, it would be useful to hammer out a framework to reduce code duplication and implement a consistent API. I'd like to gauge the interest and goals of the MLlib community: 1. Are you interested in having more clustering algorithms available? 2. Is the community interested in specifying a common framework? Thanks! RJ [1] - https://github.com/apache/spark/pull/1248 -- em rnowl...@gmail.com c 954.496.2314 -- Yee Yang Li Hector http://google.com/+HectorYee *google.com/+HectorYee http://google.com/+HectorYee* -- Yee Yang Li Hector http://google.com/+HectorYee *google.com/+HectorYee http://google.com/+HectorYee* -- em