GitHub user mccheah opened a pull request:

    https://github.com/apache/spark/pull/3275

    [SPARK-4349] Checking if parallel collection partition is serializable 

    Before, the DAGScheduler would determine if a task is serializable by doing 
a dry-run serialization of the first task in an array.
    
    However, with parallel collection partitions, some partitions may be empty, 
and therefore they would be serializable, even though other non-empty 
partitions contain unserializable objects.
    
    The solution presented here is a little hacky, to manually serialize one 
non-empty parallel collection partition if any exist in the task set. It would 
be ideal to implement a more generic solution, but that would involve extending 
the partition API. If more cases arise that involve some tasks being 
serializable and other tasks not being serializable in the same set, we would 
have no choice but to attempt to serialize all of them proactively, which could 
be expensive.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/mccheah/spark dont-hang-serialization

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/3275.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #3275
    
----
commit fafc7234ca8c6e05702bea5b15587cf9941a9c04
Author: mcheah <[email protected]>
Date:   2014-11-12T03:41:28Z

    [SPARK-4349] Checking if parallel collection partition is serializable
    
    Before, the DAGScheduler would determine if a task is serializable by
    doing a dry-run serialization of the first task in an array.
    
    However, with parallel collection partitions, some partitions may be
    empty, and therefore they would be serializable, even though other
    non-empty partitions contain unserializable objects.
    
    The solution presented here is a little hacky, to manually serialize one
    non-empty parallel collection partition if any exist in the task set. It
    would be ideal to implement a more generic solution, but that would
    involve extending the partition API. If more cases arise that involve
    some tasks being serializable and other tasks not being serializable in
    the same set, we would have no choice but to attempt to serialize all of
    them proactively, which could be expensive.

commit 6ed8608c2934a28b4152b0f8d7e4c1ad461f1005
Author: mcheah <[email protected]>
Date:   2014-11-14T19:12:19Z

    Cleaning up checking serializable parallel collection partitions.

commit e1b52728429cb2b56ff120edbe92175d65405058
Author: mcheah <[email protected]>
Date:   2014-11-14T19:12:52Z

    Merge branch 'master' into dont-hang-serialization

commit 13fb7ea1ee9d1f43c3df72548fbd3189796f35f2
Author: mcheah <[email protected]>
Date:   2014-11-14T23:22:32Z

    More cleanup in DAGScheduler

commit 08b66058e8ffca1a9a976a936f866191bfb1b653
Author: mcheah <[email protected]>
Date:   2014-11-14T23:23:52Z

    Merge branch 'master' into dont-hang-serialization

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to