[ 
https://issues.apache.org/jira/browse/SPARK-18731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15724415#comment-15724415
 ] 

Xiaoye Sun commented on SPARK-18731:
------------------------------------

My concern is not about improving the overall performance of K-means (improving 
the performance may be just the outcome but it is not my immediate goal). My 
concern is only about the huge "scheduler delay" shown on the Web UI page. I am 
working on a network system for Spark and our system experiment prefers small 
task sizes. In such case, tasks in the same stage can start almost at the same 
time across all the workers. K-means is one of the use cases of our system. I 
am wondering if there can be a K-means implementation having very small task 
sizes, and the data used by the tasks can be retrieved after the tasks have 
been deployed on the workers.

I am relatively new to Spark, so maybe what I want may not be the way how Spark 
works. 

I think the task size should always be small since I saw the warning shown at 
the driver complaining about large task size. I suppose Spark prefers small 
task size and there should be a way to work around large tasks size. 

In another word, my question is "for a task processing large data, can we have 
a small task size so that the scheduler can deploy it quickly and the executor 
retrieve the data after the task has been deployed on the worker".

Please educate me if say something wrong.

Thanks!

> Task size in K-means is so large
> --------------------------------
>
>                 Key: SPARK-18731
>                 URL: https://issues.apache.org/jira/browse/SPARK-18731
>             Project: Spark
>          Issue Type: Improvement
>          Components: MLlib
>    Affects Versions: 1.6.1
>            Reporter: Xiaoye Sun
>            Priority: Minor
>   Original Estimate: 5h
>  Remaining Estimate: 5h
>
> When run the KMeans algorithm for a large model (e.g. 100k features and 100 
> centers), there will be warning shown for many of the stages saying that the 
> task size is very large. Here is an example warning. 
> WARN TaskSetManager: Stage 23 contains a task of very large size (56256 KB). 
> The maximum recommended task size is 100 KB.
> This could happen at (sum at KMeansModel.scala:88), (takeSample at 
> KMeans.scala:378), (aggregate at KMeans.scala:404) and (collect at 
> KMeans.scala:436). 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to