Jcenter / bintray support for spark packages?
Hi Spark devs, Is it possible to add jcenter or bintray support for Spark packages? I'm trying to add our artifact which is on jcenter https://bintray.com/airbnb/aerosolve but I noticed in Spark packages it only accepts Maven coordinates. -- Yee Yang Li Hector google.com/+HectorYee - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: Spark/Mesos
Speaking as a user of spark on mesos Yes it appears that each app appears as a separate framework on the mesos master In fine grained mode the number of executors goes up and down vs fixed in coarse. I would not run fine grained mode on a large cluster as it can potentially spin up a lot of executors and DDOS the mesos master. In a shared environment coarse grained seems better behaved as you can cap the number of executors with spark --conf spark.cores.max and the executors stick around as opposed to growing and shrinking per stage.
Re: Storing large data for MLlib machine learning
I use Thrift and then base64 encode the binary and save it as text file lines that are snappy or gzip encoded. It makes it very easy to copy small chunks locally and play with subsets of the data and not have dependencies on HDFS / hadoop for server stuff for example. On Thu, Mar 26, 2015 at 2:51 PM, Ulanov, Alexander alexander.ula...@hp.com wrote: Thanks, Evan. What do you think about Protobuf? Twitter has a library to manage protobuf files in hdfs https://github.com/twitter/elephant-bird From: Evan R. Sparks [mailto:evan.spa...@gmail.com] Sent: Thursday, March 26, 2015 2:34 PM To: Stephen Boesch Cc: Ulanov, Alexander; dev@spark.apache.org Subject: Re: Storing large data for MLlib machine learning On binary file formats - I looked at HDF5+Spark a couple of years ago and found it barely JVM-friendly and very Hadoop-unfriendly (e.g. the APIs needed filenames as input, you couldn't pass it anything like an InputStream). I don't know if it has gotten any better. Parquet plays much more nicely and there are lots of spark-related projects using it already. Keep in mind that it's column-oriented which might impact performance - but basically you're going to want your features in a byte array and deser should be pretty straightforward. On Thu, Mar 26, 2015 at 2:26 PM, Stephen Boesch java...@gmail.commailto: java...@gmail.com wrote: There are some convenience methods you might consider including: MLUtils.loadLibSVMFile and MLUtils.loadLabeledPoint 2015-03-26 14:16 GMT-07:00 Ulanov, Alexander alexander.ula...@hp.com mailto:alexander.ula...@hp.com: Hi, Could you suggest what would be the reasonable file format to store feature vector data for machine learning in Spark MLlib? Are there any best practices for Spark? My data is dense feature vectors with labels. Some of the requirements are that the format should be easy loaded/serialized, randomly accessible, with a small footprint (binary). I am considering Parquet, hdf5, protocol buffer (protobuf), but I have little to no experience with them, so any suggestions would be really appreciated. Best regards, Alexander -- Yee Yang Li Hector http://google.com/+HectorYee *google.com/+HectorYee http://google.com/+HectorYee*
Re: over 10000 commits!
Congrats! On Thu, Mar 5, 2015 at 1:34 PM, shane knapp skn...@berkeley.edu wrote: WOOT! On Thu, Mar 5, 2015 at 1:26 PM, Reynold Xin r...@databricks.com wrote: We reached a new milestone today. https://github.com/apache/spark 10,001 commits now. Congratulations to Xiangrui for making the 1th commit! -- Yee Yang Li Hector http://google.com/+HectorYee *google.com/+HectorYee http://google.com/+HectorYee*
Re: [ANNOUNCE] Spark 1.2.0 Release Preview Posted
I'm getting a lot of task lost with this build in a large mesos cluster. Happens with both hash and sort shuffles. 14/11/20 18:08:38 WARN TaskSetManager: Lost task 9.1 in stage 1.0 (TID 897, i-d4d6553a.inst.aws.airbnb.com): FetchFailed(null, shuffleId=1, mapId=-1, reduceId=9, message= org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output location for shuffle 1 at org.apache.spark.MapOutputTracker$$anonfun$org$apache$spark$MapOutputTracker$$convertMapStatuses$1.apply(MapOutputTracker.scala:386) at org.apache.spark.MapOutputTracker$$anonfun$org$apache$spark$MapOutputTracker$$convertMapStatuses$1.apply(MapOutputTracker.scala:383) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108) at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:108) at org.apache.spark.MapOutputTracker$.org$apache$spark$MapOutputTracker$$convertMapStatuses(MapOutputTracker.scala:382) at org.apache.spark.MapOutputTracker.getServerStatuses(MapOutputTracker.scala:178) at org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$.fetch(BlockStoreShuffleFetcher.scala:42) at org.apache.spark.shuffle.hash.HashShuffleReader.read(HashShuffleReader.scala:40) at org.apache.spark.rdd.ShuffledRDD.compute(ShuffledRDD.scala:92) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263) at org.apache.spark.rdd.RDD.iterator(RDD.scala:230) On Thu, Nov 20, 2014 at 7:42 AM, Nan Zhu zhunanmcg...@gmail.com wrote: BTW, this PR https://github.com/apache/spark/pull/2524 is related to a blocker level bug, and this is actually close to be merged (have been reviewed for several rounds) I would appreciated if anyone can continue the process, @mateiz -- Nan Zhu http://codingcat.me On Thursday, November 20, 2014 at 10:17 AM, Corey Nolet wrote: I was actually about to post this myself- I have a complex join that could benefit from something like a GroupComparator vs having to do multiple grouyBy operations. This is probably the wrong thread for a full discussion on this but I didn't see a JIRA ticket for this or anything similar- any reasons why this would not make sense given Spark's design? On Thu, Nov 20, 2014 at 9:39 AM, Madhu ma...@madhu.com (mailto: ma...@madhu.com) wrote: Thanks Patrick. I've been testing some 1.2 features, looks good so far. I have some example code that I think will be helpful for certain MR-style use cases (secondary sort). Can I still add that to the 1.2 documentation, or is that frozen at this point? - -- Madhu https://www.linkedin.com/in/msiddalingaiah -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/ANNOUNCE-Spark-1-2-0-Release-Preview-Posted-tp9400p9449.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com (http://Nabble.com). - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org (mailto: dev-unsubscr...@spark.apache.org) For additional commands, e-mail: dev-h...@spark.apache.org (mailto: dev-h...@spark.apache.org) -- Yee Yang Li Hector http://google.com/+HectorYee *google.com/+HectorYee http://google.com/+HectorYee*
Re: [VOTE] Release Apache Spark 1.1.1 (RC2)
I'm still seeing the fetch failed error and updated https://issues.apache.org/jira/browse/SPARK-3633 On Thu, Nov 20, 2014 at 10:21 AM, Marcelo Vanzin van...@cloudera.com wrote: +1 (non-binding) . ran simple things on spark-shell . ran jobs in yarn client cluster modes, and standalone cluster mode On Wed, Nov 19, 2014 at 2:51 PM, Andrew Or and...@databricks.com wrote: Please vote on releasing the following candidate as Apache Spark version 1.1.1. This release fixes a number of bugs in Spark 1.1.0. Some of the notable ones are - [SPARK-3426] Sort-based shuffle compression settings are incompatible - [SPARK-3948] Stream corruption issues in sort-based shuffle - [SPARK-4107] Incorrect handling of Channel.read() led to data truncation The full list is at http://s.apache.org/z9h and in the CHANGES.txt attached. Additionally, this candidate fixes two blockers from the previous RC: - [SPARK-4434] Cluster mode jar URLs are broken - [SPARK-4480][SPARK-4467] Too many open files exception from shuffle spills The tag to be voted on is v1.1.1-rc2 (commit 3693ae5d): http://s.apache.org/p8 The release files, including signatures, digests, etc can be found at: http://people.apache.org/~andrewor14/spark-1.1.1-rc2/ Release artifacts are signed with the following key: https://people.apache.org/keys/committer/andrewor14.asc The staging repository for this release can be found at: https://repository.apache.org/content/repositories/orgapachespark-1043/ The documentation corresponding to this release can be found at: http://people.apache.org/~andrewor14/spark-1.1.1-rc2-docs/ Please vote on releasing this package as Apache Spark 1.1.1! The vote is open until Saturday, November 22, at 23:00 UTC and passes if a majority of at least 3 +1 PMC votes are cast. [ ] +1 Release this package as Apache Spark 1.1.1 [ ] -1 Do not release this package because ... To learn more about Apache Spark, please see http://spark.apache.org/ Cheers, Andrew - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org -- Marcelo - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org -- Yee Yang Li Hector http://google.com/+HectorYee *google.com/+HectorYee http://google.com/+HectorYee*
Re: [VOTE] Release Apache Spark 1.1.1 (RC2)
I think it is a race condition caused by netty deactivating a channel while it is active. Switched to nio and it works fine --conf spark.shuffle.blockTransferService=nio On Thu, Nov 20, 2014 at 10:44 AM, Hector Yee hector@gmail.com wrote: I'm still seeing the fetch failed error and updated https://issues.apache.org/jira/browse/SPARK-3633 On Thu, Nov 20, 2014 at 10:21 AM, Marcelo Vanzin van...@cloudera.com wrote: +1 (non-binding) . ran simple things on spark-shell . ran jobs in yarn client cluster modes, and standalone cluster mode On Wed, Nov 19, 2014 at 2:51 PM, Andrew Or and...@databricks.com wrote: Please vote on releasing the following candidate as Apache Spark version 1.1.1. This release fixes a number of bugs in Spark 1.1.0. Some of the notable ones are - [SPARK-3426] Sort-based shuffle compression settings are incompatible - [SPARK-3948] Stream corruption issues in sort-based shuffle - [SPARK-4107] Incorrect handling of Channel.read() led to data truncation The full list is at http://s.apache.org/z9h and in the CHANGES.txt attached. Additionally, this candidate fixes two blockers from the previous RC: - [SPARK-4434] Cluster mode jar URLs are broken - [SPARK-4480][SPARK-4467] Too many open files exception from shuffle spills The tag to be voted on is v1.1.1-rc2 (commit 3693ae5d): http://s.apache.org/p8 The release files, including signatures, digests, etc can be found at: http://people.apache.org/~andrewor14/spark-1.1.1-rc2/ Release artifacts are signed with the following key: https://people.apache.org/keys/committer/andrewor14.asc The staging repository for this release can be found at: https://repository.apache.org/content/repositories/orgapachespark-1043/ The documentation corresponding to this release can be found at: http://people.apache.org/~andrewor14/spark-1.1.1-rc2-docs/ Please vote on releasing this package as Apache Spark 1.1.1! The vote is open until Saturday, November 22, at 23:00 UTC and passes if a majority of at least 3 +1 PMC votes are cast. [ ] +1 Release this package as Apache Spark 1.1.1 [ ] -1 Do not release this package because ... To learn more about Apache Spark, please see http://spark.apache.org/ Cheers, Andrew - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org -- Marcelo - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org -- Yee Yang Li Hector http://google.com/+HectorYee *google.com/+HectorYee http://google.com/+HectorYee* -- Yee Yang Li Hector http://google.com/+HectorYee *google.com/+HectorYee http://google.com/+HectorYee*
Re: [VOTE] Release Apache Spark 1.1.1 (RC2)
This is whatever was in http://people.apache.org/~andrewor14/spark-1 .1.1-rc2/ On Thu, Nov 20, 2014 at 11:48 AM, Matei Zaharia matei.zaha...@gmail.com wrote: Hector, is this a comment on 1.1.1 or on the 1.2 preview? Matei On Nov 20, 2014, at 11:39 AM, Hector Yee hector@gmail.com wrote: I think it is a race condition caused by netty deactivating a channel while it is active. Switched to nio and it works fine --conf spark.shuffle.blockTransferService=nio On Thu, Nov 20, 2014 at 10:44 AM, Hector Yee hector@gmail.com wrote: I'm still seeing the fetch failed error and updated https://issues.apache.org/jira/browse/SPARK-3633 On Thu, Nov 20, 2014 at 10:21 AM, Marcelo Vanzin van...@cloudera.com wrote: +1 (non-binding) . ran simple things on spark-shell . ran jobs in yarn client cluster modes, and standalone cluster mode On Wed, Nov 19, 2014 at 2:51 PM, Andrew Or and...@databricks.com wrote: Please vote on releasing the following candidate as Apache Spark version 1.1.1. This release fixes a number of bugs in Spark 1.1.0. Some of the notable ones are - [SPARK-3426] Sort-based shuffle compression settings are incompatible - [SPARK-3948] Stream corruption issues in sort-based shuffle - [SPARK-4107] Incorrect handling of Channel.read() led to data truncation The full list is at http://s.apache.org/z9h and in the CHANGES.txt attached. Additionally, this candidate fixes two blockers from the previous RC: - [SPARK-4434] Cluster mode jar URLs are broken - [SPARK-4480][SPARK-4467] Too many open files exception from shuffle spills The tag to be voted on is v1.1.1-rc2 (commit 3693ae5d): http://s.apache.org/p8 The release files, including signatures, digests, etc can be found at: http://people.apache.org/~andrewor14/spark-1.1.1-rc2/ Release artifacts are signed with the following key: https://people.apache.org/keys/committer/andrewor14.asc The staging repository for this release can be found at: https://repository.apache.org/content/repositories/orgapachespark-1043/ The documentation corresponding to this release can be found at: http://people.apache.org/~andrewor14/spark-1.1.1-rc2-docs/ Please vote on releasing this package as Apache Spark 1.1.1! The vote is open until Saturday, November 22, at 23:00 UTC and passes if a majority of at least 3 +1 PMC votes are cast. [ ] +1 Release this package as Apache Spark 1.1.1 [ ] -1 Do not release this package because ... To learn more about Apache Spark, please see http://spark.apache.org/ Cheers, Andrew - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org -- Marcelo - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org -- Yee Yang Li Hector http://google.com/+HectorYee *google.com/+HectorYee http://google.com/+HectorYee* -- Yee Yang Li Hector http://google.com/+HectorYee *google.com/+HectorYee http://google.com/+HectorYee* -- Yee Yang Li Hector http://google.com/+HectorYee *google.com/+HectorYee http://google.com/+HectorYee*
Re: [VOTE] Release Apache Spark 1.1.1 (RC2)
Whoops I must have used the 1.2 preview and mixed them up. spark-shell -version shows version 1.2.0 Will update the bug https://issues.apache.org/jira/browse/SPARK-4516 to 1.2 On Thu, Nov 20, 2014 at 11:59 AM, Matei Zaharia matei.zaha...@gmail.com wrote: Ah, I see. But the spark.shuffle.blockTransferService property doesn't exist in 1.1 (AFAIK) -- what exactly are you doing to get this problem? Matei On Nov 20, 2014, at 11:50 AM, Hector Yee hector@gmail.com wrote: This is whatever was in http://people.apache.org/~andrewor14/spark-1 .1.1-rc2/ On Thu, Nov 20, 2014 at 11:48 AM, Matei Zaharia matei.zaha...@gmail.com wrote: Hector, is this a comment on 1.1.1 or on the 1.2 preview? Matei On Nov 20, 2014, at 11:39 AM, Hector Yee hector@gmail.com wrote: I think it is a race condition caused by netty deactivating a channel while it is active. Switched to nio and it works fine --conf spark.shuffle.blockTransferService=nio On Thu, Nov 20, 2014 at 10:44 AM, Hector Yee hector@gmail.com wrote: I'm still seeing the fetch failed error and updated https://issues.apache.org/jira/browse/SPARK-3633 On Thu, Nov 20, 2014 at 10:21 AM, Marcelo Vanzin van...@cloudera.com wrote: +1 (non-binding) . ran simple things on spark-shell . ran jobs in yarn client cluster modes, and standalone cluster mode On Wed, Nov 19, 2014 at 2:51 PM, Andrew Or and...@databricks.com wrote: Please vote on releasing the following candidate as Apache Spark version 1.1.1. This release fixes a number of bugs in Spark 1.1.0. Some of the notable ones are - [SPARK-3426] Sort-based shuffle compression settings are incompatible - [SPARK-3948] Stream corruption issues in sort-based shuffle - [SPARK-4107] Incorrect handling of Channel.read() led to data truncation The full list is at http://s.apache.org/z9h and in the CHANGES.txt attached. Additionally, this candidate fixes two blockers from the previous RC: - [SPARK-4434] Cluster mode jar URLs are broken - [SPARK-4480][SPARK-4467] Too many open files exception from shuffle spills The tag to be voted on is v1.1.1-rc2 (commit 3693ae5d): http://s.apache.org/p8 The release files, including signatures, digests, etc can be found at: http://people.apache.org/~andrewor14/spark-1.1.1-rc2/ Release artifacts are signed with the following key: https://people.apache.org/keys/committer/andrewor14.asc The staging repository for this release can be found at: https://repository.apache.org/content/repositories/orgapachespark-1043/ The documentation corresponding to this release can be found at: http://people.apache.org/~andrewor14/spark-1.1.1-rc2-docs/ Please vote on releasing this package as Apache Spark 1.1.1! The vote is open until Saturday, November 22, at 23:00 UTC and passes if a majority of at least 3 +1 PMC votes are cast. [ ] +1 Release this package as Apache Spark 1.1.1 [ ] -1 Do not release this package because ... To learn more about Apache Spark, please see http://spark.apache.org/ Cheers, Andrew - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org -- Marcelo - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org -- Yee Yang Li Hector http://google.com/+HectorYee *google.com/+HectorYee http://google.com/+HectorYee* -- Yee Yang Li Hector http://google.com/+HectorYee *google.com/+HectorYee http://google.com/+HectorYee* -- Yee Yang Li Hector http://google.com/+HectorYee *google.com/+HectorYee http://google.com/+HectorYee* -- Yee Yang Li Hector http://google.com/+HectorYee *google.com/+HectorYee http://google.com/+HectorYee*
Re: Contributing to MLlib: Proposal for Clustering Algorithms
I would say for bigdata applications the most useful would be hierarchical k-means with back tracking and the ability to support k nearest centroids. On Tue, Jul 8, 2014 at 10:54 AM, RJ Nowling rnowl...@gmail.com wrote: Hi all, MLlib currently has one clustering algorithm implementation, KMeans. It would benefit from having implementations of other clustering algorithms such as MiniBatch KMeans, Fuzzy C-Means, Hierarchical Clustering, and Affinity Propagation. I recently submitted a PR [1] for a MiniBatch KMeans implementation, and I saw an email on this list about interest in implementing Fuzzy C-Means. Based on Sean Owen's review of my MiniBatch KMeans code, it became apparent that before I implement more clustering algorithms, it would be useful to hammer out a framework to reduce code duplication and implement a consistent API. I'd like to gauge the interest and goals of the MLlib community: 1. Are you interested in having more clustering algorithms available? 2. Is the community interested in specifying a common framework? Thanks! RJ [1] - https://github.com/apache/spark/pull/1248 -- em rnowl...@gmail.com c 954.496.2314 -- Yee Yang Li Hector http://google.com/+HectorYee *google.com/+HectorYee http://google.com/+HectorYee*
Re: Contributing to MLlib: Proposal for Clustering Algorithms
No idea, never looked it up. Always just implemented it as doing k-means again on each cluster. FWIW standard k-means with euclidean distance has problems too with some dimensionality reduction methods. Swapping out the distance metric with negative dot or cosine may help. Other more useful clustering would be hierarchical SVD. The reason why I like hierarchical clustering is it makes for faster inference especially over billions of users. On Tue, Jul 8, 2014 at 1:24 PM, Dmitriy Lyubimov dlie...@gmail.com wrote: Hector, could you share the references for hierarchical K-means? thanks. On Tue, Jul 8, 2014 at 1:01 PM, Hector Yee hector@gmail.com wrote: I would say for bigdata applications the most useful would be hierarchical k-means with back tracking and the ability to support k nearest centroids. On Tue, Jul 8, 2014 at 10:54 AM, RJ Nowling rnowl...@gmail.com wrote: Hi all, MLlib currently has one clustering algorithm implementation, KMeans. It would benefit from having implementations of other clustering algorithms such as MiniBatch KMeans, Fuzzy C-Means, Hierarchical Clustering, and Affinity Propagation. I recently submitted a PR [1] for a MiniBatch KMeans implementation, and I saw an email on this list about interest in implementing Fuzzy C-Means. Based on Sean Owen's review of my MiniBatch KMeans code, it became apparent that before I implement more clustering algorithms, it would be useful to hammer out a framework to reduce code duplication and implement a consistent API. I'd like to gauge the interest and goals of the MLlib community: 1. Are you interested in having more clustering algorithms available? 2. Is the community interested in specifying a common framework? Thanks! RJ [1] - https://github.com/apache/spark/pull/1248 -- em rnowl...@gmail.com c 954.496.2314 -- Yee Yang Li Hector http://google.com/+HectorYee *google.com/+HectorYee http://google.com/+HectorYee* -- Yee Yang Li Hector http://google.com/+HectorYee *google.com/+HectorYee http://google.com/+HectorYee*
Re: Contributing to MLlib: Proposal for Clustering Algorithms
K doesn't matter much I've tried anything from 2^10 to 10^3 and the performance doesn't change much as measured by precision @ K. (see table 1 http://machinelearning.wustl.edu/mlpapers/papers/weston13). Although 10^3 kmeans did outperform 2^10 hierarchical SVD slightly in terms of the metrics, 2^10 SVD was much faster in terms of inference time. I found the thing that affected performance most was adding in back tracking to fix mistakes made at higher levels rather than how the K is picked per level. On Tue, Jul 8, 2014 at 1:50 PM, Dmitriy Lyubimov dlie...@gmail.com wrote: sure. more interesting problem here is choosing k at each level. Kernel methods seem to be most promising. On Tue, Jul 8, 2014 at 1:31 PM, Hector Yee hector@gmail.com wrote: No idea, never looked it up. Always just implemented it as doing k-means again on each cluster. FWIW standard k-means with euclidean distance has problems too with some dimensionality reduction methods. Swapping out the distance metric with negative dot or cosine may help. Other more useful clustering would be hierarchical SVD. The reason why I like hierarchical clustering is it makes for faster inference especially over billions of users. On Tue, Jul 8, 2014 at 1:24 PM, Dmitriy Lyubimov dlie...@gmail.com wrote: Hector, could you share the references for hierarchical K-means? thanks. On Tue, Jul 8, 2014 at 1:01 PM, Hector Yee hector@gmail.com wrote: I would say for bigdata applications the most useful would be hierarchical k-means with back tracking and the ability to support k nearest centroids. On Tue, Jul 8, 2014 at 10:54 AM, RJ Nowling rnowl...@gmail.com wrote: Hi all, MLlib currently has one clustering algorithm implementation, KMeans. It would benefit from having implementations of other clustering algorithms such as MiniBatch KMeans, Fuzzy C-Means, Hierarchical Clustering, and Affinity Propagation. I recently submitted a PR [1] for a MiniBatch KMeans implementation, and I saw an email on this list about interest in implementing Fuzzy C-Means. Based on Sean Owen's review of my MiniBatch KMeans code, it became apparent that before I implement more clustering algorithms, it would be useful to hammer out a framework to reduce code duplication and implement a consistent API. I'd like to gauge the interest and goals of the MLlib community: 1. Are you interested in having more clustering algorithms available? 2. Is the community interested in specifying a common framework? Thanks! RJ [1] - https://github.com/apache/spark/pull/1248 -- em rnowl...@gmail.com c 954.496.2314 -- Yee Yang Li Hector http://google.com/+HectorYee *google.com/+HectorYee http://google.com/+HectorYee* -- Yee Yang Li Hector http://google.com/+HectorYee *google.com/+HectorYee http://google.com/+HectorYee* -- Yee Yang Li Hector http://google.com/+HectorYee *google.com/+HectorYee http://google.com/+HectorYee*
Re: Contributing to MLlib: Proposal for Clustering Algorithms
No was thinking more top-down: assuming a distributed kmeans system already existing, recursively apply the kmeans algorithm on data already partitioned by the previous level of kmeans. I haven't been much of a fan of bottom up approaches like HAC mainly because they assume there is already a distance metric for items to items. This makes it hard to cluster new content. The distances between sibling clusters is also hard to compute (if you have thrown away the similarity matrix), do you count paths to same parent node if you are computing distances between items in two adjacent nodes for example. It is also a bit harder to distribute the computation for bottom up approaches as you have to already find the nearest neighbor to an item to begin the process. On Tue, Jul 8, 2014 at 1:59 PM, RJ Nowling rnowl...@gmail.com wrote: The scikit-learn implementation may be of interest: http://scikit-learn.org/stable/modules/generated/sklearn.cluster.Ward.html#sklearn.cluster.Ward It's a bottom up approach. The pair of clusters for merging are chosen to minimize variance. Their code is under a BSD license so it can be used as a template. Is something like that you were thinking Hector? On Tue, Jul 8, 2014 at 4:50 PM, Dmitriy Lyubimov dlie...@gmail.com wrote: sure. more interesting problem here is choosing k at each level. Kernel methods seem to be most promising. On Tue, Jul 8, 2014 at 1:31 PM, Hector Yee hector@gmail.com wrote: No idea, never looked it up. Always just implemented it as doing k-means again on each cluster. FWIW standard k-means with euclidean distance has problems too with some dimensionality reduction methods. Swapping out the distance metric with negative dot or cosine may help. Other more useful clustering would be hierarchical SVD. The reason why I like hierarchical clustering is it makes for faster inference especially over billions of users. On Tue, Jul 8, 2014 at 1:24 PM, Dmitriy Lyubimov dlie...@gmail.com wrote: Hector, could you share the references for hierarchical K-means? thanks. On Tue, Jul 8, 2014 at 1:01 PM, Hector Yee hector@gmail.com wrote: I would say for bigdata applications the most useful would be hierarchical k-means with back tracking and the ability to support k nearest centroids. On Tue, Jul 8, 2014 at 10:54 AM, RJ Nowling rnowl...@gmail.com wrote: Hi all, MLlib currently has one clustering algorithm implementation, KMeans. It would benefit from having implementations of other clustering algorithms such as MiniBatch KMeans, Fuzzy C-Means, Hierarchical Clustering, and Affinity Propagation. I recently submitted a PR [1] for a MiniBatch KMeans implementation, and I saw an email on this list about interest in implementing Fuzzy C-Means. Based on Sean Owen's review of my MiniBatch KMeans code, it became apparent that before I implement more clustering algorithms, it would be useful to hammer out a framework to reduce code duplication and implement a consistent API. I'd like to gauge the interest and goals of the MLlib community: 1. Are you interested in having more clustering algorithms available? 2. Is the community interested in specifying a common framework? Thanks! RJ [1] - https://github.com/apache/spark/pull/1248 -- em rnowl...@gmail.com c 954.496.2314 -- Yee Yang Li Hector http://google.com/+HectorYee *google.com/+HectorYee http://google.com/+HectorYee* -- Yee Yang Li Hector http://google.com/+HectorYee *google.com/+HectorYee http://google.com/+HectorYee* -- em rnowl...@gmail.com c 954.496.2314 -- Yee Yang Li Hector http://google.com/+HectorYee *google.com/+HectorYee http://google.com/+HectorYee*
Re: Contributing to MLlib: Proposal for Clustering Algorithms
Yeah if one were to replace the objective function in decision tree with minimizing the variance of the leaf nodes it would be a hierarchical clusterer. On Tue, Jul 8, 2014 at 2:12 PM, Evan R. Sparks evan.spa...@gmail.com wrote: If you're thinking along these lines, have a look at the DecisionTree implementation in MLlib. It uses the same idea and is optimized to prevent multiple passes over the data by computing several splits at each level of tree building. The tradeoff is increased model state and computation per pass over the data, but fewer total passes and hopefully lower communication overheads than, say, shuffling data around that belongs to one cluster or another. Something like that could work here as well. I'm not super-familiar with hierarchical K-Means so perhaps there's a more efficient way to implement it, though. On Tue, Jul 8, 2014 at 2:06 PM, Hector Yee hector@gmail.com wrote: No was thinking more top-down: assuming a distributed kmeans system already existing, recursively apply the kmeans algorithm on data already partitioned by the previous level of kmeans. I haven't been much of a fan of bottom up approaches like HAC mainly because they assume there is already a distance metric for items to items. This makes it hard to cluster new content. The distances between sibling clusters is also hard to compute (if you have thrown away the similarity matrix), do you count paths to same parent node if you are computing distances between items in two adjacent nodes for example. It is also a bit harder to distribute the computation for bottom up approaches as you have to already find the nearest neighbor to an item to begin the process. On Tue, Jul 8, 2014 at 1:59 PM, RJ Nowling rnowl...@gmail.com wrote: The scikit-learn implementation may be of interest: http://scikit-learn.org/stable/modules/generated/sklearn.cluster.Ward.html#sklearn.cluster.Ward It's a bottom up approach. The pair of clusters for merging are chosen to minimize variance. Their code is under a BSD license so it can be used as a template. Is something like that you were thinking Hector? On Tue, Jul 8, 2014 at 4:50 PM, Dmitriy Lyubimov dlie...@gmail.com wrote: sure. more interesting problem here is choosing k at each level. Kernel methods seem to be most promising. On Tue, Jul 8, 2014 at 1:31 PM, Hector Yee hector@gmail.com wrote: No idea, never looked it up. Always just implemented it as doing k-means again on each cluster. FWIW standard k-means with euclidean distance has problems too with some dimensionality reduction methods. Swapping out the distance metric with negative dot or cosine may help. Other more useful clustering would be hierarchical SVD. The reason why I like hierarchical clustering is it makes for faster inference especially over billions of users. On Tue, Jul 8, 2014 at 1:24 PM, Dmitriy Lyubimov dlie...@gmail.com wrote: Hector, could you share the references for hierarchical K-means? thanks. On Tue, Jul 8, 2014 at 1:01 PM, Hector Yee hector@gmail.com wrote: I would say for bigdata applications the most useful would be hierarchical k-means with back tracking and the ability to support k nearest centroids. On Tue, Jul 8, 2014 at 10:54 AM, RJ Nowling rnowl...@gmail.com wrote: Hi all, MLlib currently has one clustering algorithm implementation, KMeans. It would benefit from having implementations of other clustering algorithms such as MiniBatch KMeans, Fuzzy C-Means, Hierarchical Clustering, and Affinity Propagation. I recently submitted a PR [1] for a MiniBatch KMeans implementation, and I saw an email on this list about interest in implementing Fuzzy C-Means. Based on Sean Owen's review of my MiniBatch KMeans code, it became apparent that before I implement more clustering algorithms, it would be useful to hammer out a framework to reduce code duplication and implement a consistent API. I'd like to gauge the interest and goals of the MLlib community: 1. Are you interested in having more clustering algorithms available? 2. Is the community interested in specifying a common framework? Thanks! RJ [1] - https://github.com/apache/spark/pull/1248 -- em rnowl...@gmail.com c 954.496.2314 -- Yee Yang Li Hector http://google.com/+HectorYee *google.com/+HectorYee http://google.com/+HectorYee* -- Yee Yang Li Hector http://google.com/+HectorYee *google.com/+HectorYee http://google.com/+HectorYee* -- em