[GitHub] spark issue #16053: [SPARK-17931] Eliminate unncessary task (de) serializati...

kayousterhout Mon, 19 Dec 2016 10:18:43 -0800

Github user kayousterhout commented on the issue:

    https://github.com/apache/spark/pull/16053
  
    Thanks for the review @squito.  I got sidetracked from this at the end of 
last week and forgot to post the results of some benchmarks @shivaram and I did 
on a 20-machine m2.4xlarge EC2 machines (160 cores).  We ran ~30 trials of code 
[1] (a very simple job with 10K tasks per stage) and measured the average time 
per stage:
    
    Before this change: 2490ms
    With this change: 2345 ms (so ~6% improvement over the baseline)
    With @witgo's approach in #15505: 2046 ms (~18% improvement over baseline)
    
    The reason that #15505 has a more significant improvement is that it also 
moves the serialization from the TaskSchedulerImpl thread to the 
CoarseGrainedSchedulerBackend thread.  I added that functionality on top of 
this change, and got almost the same improvement as #15505 (average of 2103ms). 
 I think we should decouple these two changes, both so we have some record of 
the improvement form each individual improvement, and because this change is 
more about simplifying the code base (the improvement is negligible) while the 
other is about performance improvement.  I filed a separate JIRA for that issue 
here: https://issues.apache.org/jira/browse/SPARK-18890



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark issue #16053: [SPARK-17931] Eliminate unncessary task (de) serializati...

Reply via email to