GitHub user mccheah opened a pull request:
https://github.com/apache/spark/pull/3275
[SPARK-4349] Checking if parallel collection partition is serializable
Before, the DAGScheduler would determine if a task is serializable by doing
a dry-run serialization of the first task in an array.
However, with parallel collection partitions, some partitions may be empty,
and therefore they would be serializable, even though other non-empty
partitions contain unserializable objects.
The solution presented here is a little hacky, to manually serialize one
non-empty parallel collection partition if any exist in the task set. It would
be ideal to implement a more generic solution, but that would involve extending
the partition API. If more cases arise that involve some tasks being
serializable and other tasks not being serializable in the same set, we would
have no choice but to attempt to serialize all of them proactively, which could
be expensive.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/mccheah/spark dont-hang-serialization
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/3275.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #3275
----
commit fafc7234ca8c6e05702bea5b15587cf9941a9c04
Author: mcheah <[email protected]>
Date: 2014-11-12T03:41:28Z
[SPARK-4349] Checking if parallel collection partition is serializable
Before, the DAGScheduler would determine if a task is serializable by
doing a dry-run serialization of the first task in an array.
However, with parallel collection partitions, some partitions may be
empty, and therefore they would be serializable, even though other
non-empty partitions contain unserializable objects.
The solution presented here is a little hacky, to manually serialize one
non-empty parallel collection partition if any exist in the task set. It
would be ideal to implement a more generic solution, but that would
involve extending the partition API. If more cases arise that involve
some tasks being serializable and other tasks not being serializable in
the same set, we would have no choice but to attempt to serialize all of
them proactively, which could be expensive.
commit 6ed8608c2934a28b4152b0f8d7e4c1ad461f1005
Author: mcheah <[email protected]>
Date: 2014-11-14T19:12:19Z
Cleaning up checking serializable parallel collection partitions.
commit e1b52728429cb2b56ff120edbe92175d65405058
Author: mcheah <[email protected]>
Date: 2014-11-14T19:12:52Z
Merge branch 'master' into dont-hang-serialization
commit 13fb7ea1ee9d1f43c3df72548fbd3189796f35f2
Author: mcheah <[email protected]>
Date: 2014-11-14T23:22:32Z
More cleanup in DAGScheduler
commit 08b66058e8ffca1a9a976a936f866191bfb1b653
Author: mcheah <[email protected]>
Date: 2014-11-14T23:23:52Z
Merge branch 'master' into dont-hang-serialization
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]