Github user djvulee commented on the issue:
https://github.com/apache/spark/pull/15505
>I agree with Kay that putting in a smaller change first is better,
assuming it still has the performance gains. That doesn't preclude any further
optimizations that are bigger changes.
>I'm a little surprised that the serializing tasks has much of an impact,
given how little data is getting serialized. But if it really is, I feel like
there is a much bigger optimization we're completely missing. Why are we
repeating the work of serialization for each task in a taskset? The serialized
data is almost exactly the same for every task. they only differ in the
partition id (an int) and the preferred locations (which aren't even used by
the executor at all).
>Task serialization already leverages the idea of having info across all
the tasks in the Broadcast for the task binary. We just need to use that same
idea for all the rest of the task data that is sent to the executor. Then the
only difference between the serialized task data sent to executors is the int
for the partitionId. You'd serialize into a bytebuffer once, and then your
per-task "serialization" becomes copying the buffer and modifying that int
directly.
@squito I like this idea very much. I just encounte the de-serialization
time is too long (about more than 10s for some tasks). Is there any PR try to
solve this?
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]