GitHub user kayousterhout opened a pull request:
https://github.com/apache/spark/pull/16053
[SPARK-17931] Eliminate unncessary task (de) serialization
## What changes were proposed in this pull request?
In the existing code, there are three layers of serialization
involved in sending a task from the scheduler to an executor:
- A Task object is serialized
- The Task object is copied to a byte buffer that also
contains serialized information about any additional JARs,
files, and Properties needed for the task to execute. This
byte buffer is stored as the member variable serializedTask
in the TaskDescription class.
- The TaskDescription is serialized (in addition to the serialized
task + JARs, the TaskDescription class contains the task ID and
other metadata) and sent in a LaunchTask message.
While it *is* necessary to have two layers of serialization, so that
the JAR, file, and Property info can be deserialized prior to
deserializing the Task object, the third layer of deserialization is
unnecessary. This commit eliminates a layer of serialization by moving
the JARs, files, and Properties into the TaskDescription class.
## How was this patch tested?
Unit tests
This is a simpler alternative to the approach proposed in #15505.
The biggest difference in functionality from the approach there is that, in
that code, all of the serialization occurs in one place (in
CoarseGrainedExecutorBackend), whereas this approach maintains the split of
serialization (where some happens in TaskSetManager and some in
CoarseGrainedExecutorBackend) that was present in the existing code. I do
think there are some benefits of doing all of the serialization in one place
(e.g., to time it all, or to enable opportunities for parallelism in the
future) but given that we don't take advantage of any of those thigns
currently, it doesn't seem necessary to change (and it's simpler not to change
it).
cc @shivaram
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/kayousterhout/spark-1 SPARK-17931
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/16053.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #16053
----
commit e3cb71ea57e135975ce3e28caaba314a9a599b86
Author: Kay Ousterhout <[email protected]>
Date: 2016-11-29T06:23:45Z
[SPARK-17931] Eliminate unncessary task (de) serialization
In the existing code, there are three layers of serialization
involved in sending a task from the scheduler to an executor:
- A Task object is serialized
- The Task object is copied to a byte buffer that also
contains serialized information about any additional JARs,
files, and Properties needed for the task to execute. This
byte buffer is stored as the member variable serializedTask
in the TaskDescription class.
- The TaskDescription is serialized (in addition to the serialized
task + JARs, the TaskDescription class contains the task ID and
other metadata) and sent in a LaunchTask message.
While it *is* necessary to have two layers of serialization, so that
the JAR, file, and Property info can be deserialized prior to
deserializing the Task object, the third layer of deserialization is
unnecessary. This commit eliminates a layer of serialization by moving
the JARs, files, and Properties into the TaskDescription class.
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]