Github user mccheah commented on the pull request:
https://github.com/apache/spark/pull/3275#issuecomment-63149863
Please consider the design issues that I think this bug uncovers before
providing comment on the PR.
From what I understand, the original design was to catch task serialization
errors in the DAGScheduler up front. I'm not sure if this is the right way to
handle these errors. Why shouldn't the TaskSetManager be able to correctly
handle serialization failures? We're essentially doing an extra unnecessary
serialization call in the DAGScheduler, which is ultimately wasted effort when
the Task Set Manager is going to serialize the tasks anyways.
Also this uncovers the flawed assumption that, just because one task is
serializable then all tasks in the task set are serializable. I'm not sure if
we should rely on that assumption, since custom RDDs and custom implementations
of Partition may not be universally serializable. We should revisit this
assumption.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]