Github user squito commented on the issue:
https://github.com/apache/spark/pull/15505
I agree with Kay that putting in a smaller change first is better, assuming
it still has the performance gains. That doesn't preclude any further
optimizations that are bigger changes.
I'm a little surprised that the serializing tasks has much of an impact,
given how little data is getting serialized. But if it really is, I feel like
there is a much bigger optimization we're completely missing. Why are we
repeating the work of serialization for each task in a taskset? The serialized
data is almost exactly the same for *every* task. they only differ in the
partition id (an int) and the preferred locations (which aren't even used by
the executor at all).
Task serialization already leverages the idea of having info across all the
tasks in the Broadcast for the task binary. We just need to use that same idea
for all the rest of the task data that is sent to the executor. Then the only
difference between the serialized task data sent to executors is the int for
the partitionId. You'd serialize into a bytebuffer once, and then your
per-task "serialization" becomes copying the buffer and modifying that int
directly.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]