[ https://issues.apache.org/jira/browse/SPARK-4039?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15146465#comment-15146465 ]
yuhao yang edited comment on SPARK-4039 at 2/27/16 7:56 PM: ------------------------------------------------------------ https://github.com/hhbyyh/spark/blob/kmeansSparse/mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeans.scala updated: I got an implementation there that supports sparse k-means centers. The new implementation tries to find the balance between memory consumption and computation cost. It tries to use sparse data structure during the computation and dynamically converts some denser ones into dense vectors (threshold can be controlled by parameter). This way, it reduce the memory consumption by leveraging sparse vectors as much as possible, and also shorten the computation cost by compressing the network communication. Performance improvements has been exhibited in both dense and sparse input data. Welcome to try and comment. was (Author: yuhaoyan): https://github.com/hhbyyh/spark/blob/kmeansSparse/mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeans.scala I got an implementation there that supports sparse k-means centers. The calculation pattern can be switched via an extra parameter and users can choose which pattern to use. As expected, it can save a lot of memory according to the average sparsity of the cluster centers, but will consume much more time also. For feature dimension of 10M and nonzero rate 1e-6, it can reduce memory consumption by 40 times yet used 700% time. Welcome to use if you really need to support large dimension k-means. > KMeans support sparse cluster centers > ------------------------------------- > > Key: SPARK-4039 > URL: https://issues.apache.org/jira/browse/SPARK-4039 > Project: Spark > Issue Type: Improvement > Components: MLlib > Affects Versions: 1.1.0 > Reporter: Antoine Amend > Labels: clustering > > When the number of features is not known, it might be quite helpful to create > sparse vectors using HashingTF.transform. KMeans transforms centers vectors > to dense vectors > (https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeans.scala#L307), > therefore leading to OutOfMemory (even with small k). > Any way to keep vectors sparse ? -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org