[GitHub] spark issue #15505: [SPARK-17931][CORE] taskScheduler has some unneeded seri...

squito Tue, 29 Nov 2016 12:54:11 -0800

Github user squito commented on the issue:

    https://github.com/apache/spark/pull/15505
  
    I agree with Kay that putting in a smaller change first is better, assuming 
it still has the performance gains.  That doesn't preclude any further 
optimizations that are bigger changes.
    
    I'm a little surprised that the serializing tasks has much of an impact, 
given how little data is getting serialized.  But if it really is, I feel like 
there is a much bigger optimization we're completely missing.  Why are we 
repeating the work of serialization for each task in a taskset?  The serialized 
data is almost exactly the same for *every* task.  they only differ in the 
partition id (an int) and the preferred locations (which aren't even used by 
the executor at all).
    
    Task serialization already leverages the idea of having info across all the 
tasks in the Broadcast for the task binary.  We just need to use that same idea 
for all the rest of the task data that is sent to the executor.  Then the only 
difference between the serialized task data sent to executors is the int for 
the partitionId.  You'd serialize into a bytebuffer once, and then your 
per-task "serialization" becomes copying the buffer and modifying that int 
directly.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark issue #15505: [SPARK-17931][CORE] taskScheduler has some unneeded seri...

Reply via email to