[ 
https://issues.apache.org/jira/browse/SPARK-18890?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kay Ousterhout resolved SPARK-18890.
------------------------------------
    Resolution: Invalid

I closed this because, as [~imranr] pointed out on the PR, these already happen 
in the same thread.  [~witgo], can you change your PR to reference SPARK-19486, 
which describes the behavior you implemented?

> Do all task serialization in CoarseGrainedExecutorBackend thread (rather than 
> TaskSchedulerImpl)
> ------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-18890
>                 URL: https://issues.apache.org/jira/browse/SPARK-18890
>             Project: Spark
>          Issue Type: Improvement
>          Components: Scheduler
>    Affects Versions: 2.1.0
>            Reporter: Kay Ousterhout
>            Priority: Minor
>
>  As part of benchmarking this change: 
> https://github.com/apache/spark/pull/15505 and alternatives, [~shivaram] and 
> I found that moving task serialization from TaskSetManager (which happens as 
> part of the TaskSchedulerImpl's thread) to CoarseGranedSchedulerBackend leads 
> to approximately a 10% reduction in job runtime for a job that counted 10,000 
> partitions (that each had 1 int) using 20 machines.  Similar performance 
> improvements were reported in the pull request linked above.  This would 
> appear to be because the TaskSchedulerImpl thread is the bottleneck, so 
> moving serialization to CGSB reduces runtime.  This change may *not* improve 
> runtime (and could potentially worsen runtime) in scenarios where the CGSB 
> thread is the bottleneck (e.g., if tasks are very large, so calling launch to 
> send the tasks to the executor blocks on the network).
> One benefit of implementing this change is that it makes it easier to 
> parallelize the serialization of tasks (different tasks could be serialized 
> by different threads).  Another benefit is that all of the serialization 
> occurs in the same place (currently, the Task is serialized in 
> TaskSetManager, and the TaskDescription is serialized in CGSB).
> I'm not totally convinced we should fix this because it seems like there are 
> better ways of reducing the serialization time (e.g., by re-using a single 
> serialized object with the Task/jars/files and broadcasting it for each 
> stage) but I wanted to open this JIRA to document the discussion.
> cc [~witgo]



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to