[ 
https://issues.apache.org/jira/browse/SPARK-4039?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15146465#comment-15146465
 ] 

yuhao yang edited comment on SPARK-4039 at 2/27/16 7:56 PM:
------------------------------------------------------------

https://github.com/hhbyyh/spark/blob/kmeansSparse/mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeans.scala

updated: 
I got an implementation there that supports sparse k-means centers. The new 
implementation tries to find the balance between memory consumption and 
computation cost. It tries to use sparse data structure during the computation 
and dynamically converts some denser ones into dense vectors (threshold can be 
controlled by parameter). This way, it reduce the memory consumption by 
leveraging sparse vectors as much as possible, and also shorten the computation 
cost by compressing the network communication.
Performance improvements has been exhibited in both dense and sparse input 
data. Welcome to try and comment.



was (Author: yuhaoyan):
https://github.com/hhbyyh/spark/blob/kmeansSparse/mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeans.scala

I got an implementation there that supports sparse k-means centers. The 
calculation pattern can be switched via an extra parameter and users can choose 
which pattern to use. As expected, it can save a lot of memory according to the 
average sparsity of the cluster centers, but will consume much more time also.

For feature dimension of 10M and nonzero rate 1e-6, it can reduce memory 
consumption by 40 times yet used 700% time. Welcome to use if you really need 
to support large dimension k-means. 

> KMeans support sparse cluster centers
> -------------------------------------
>
>                 Key: SPARK-4039
>                 URL: https://issues.apache.org/jira/browse/SPARK-4039
>             Project: Spark
>          Issue Type: Improvement
>          Components: MLlib
>    Affects Versions: 1.1.0
>            Reporter: Antoine Amend
>              Labels: clustering
>
> When the number of features is not known, it might be quite helpful to create 
> sparse vectors using HashingTF.transform. KMeans transforms centers vectors 
> to dense vectors 
> (https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeans.scala#L307),
>  therefore leading to OutOfMemory (even with small k).
> Any way to keep vectors sparse ?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to