[ 
https://issues.apache.org/jira/browse/SPARK-18731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15724335#comment-15724335
 ] 

yuhao yang commented on SPARK-18731:
------------------------------------

Based on my experiences, generally KMeans is fast even for large data size. 
100K * 100 is not that large (80M). If the warning concerns you, just try 
different partition size to see if there's any performance improvement 
(probably not). And you can also try to optimize K-Means with GEMM operations 
and use some native BLAS library which can make it much faster depending on 
your data size.



> Task size in K-means is so large
> --------------------------------
>
>                 Key: SPARK-18731
>                 URL: https://issues.apache.org/jira/browse/SPARK-18731
>             Project: Spark
>          Issue Type: Improvement
>          Components: MLlib
>    Affects Versions: 1.6.1
>            Reporter: Xiaoye Sun
>            Priority: Minor
>   Original Estimate: 5h
>  Remaining Estimate: 5h
>
> When run the KMeans algorithm for a large model (e.g. 100k features and 100 
> centers), there will be warning shown for many of the stages saying that the 
> task size is very large. Here is an example warning. 
> WARN TaskSetManager: Stage 23 contains a task of very large size (56256 KB). 
> The maximum recommended task size is 100 KB.
> This could happen at (sum at KMeansModel.scala:88), (takeSample at 
> KMeans.scala:378), (aggregate at KMeans.scala:404) and (collect at 
> KMeans.scala:436). 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to