Github user kayousterhout commented on the pull request:
https://github.com/apache/spark/pull/186#issuecomment-38336673
I feel conflicted about this -- I wonder if it's better to just never
restart (i.e., when an exception gets thrown, end the SparkContext with an
exception). Have others seen examples of exceptions in the DAGScheduler that
we can recover from?
I filed the JIRA leading to this change because a PythonAccumulator was
throwing a SparkException (exposing this DAGScheduler issue). I originally
tried to fix the handling of this exception (see
https://spark-project.atlassian.net/browse/SPARK-1290) but the exception gets
thrown when the Python Spark Context gets disconnected from the Scala Spark
Context -- so I don't think the problem can be recovered from. In other words,
for that case, I think it makes most sense to just die, because we're never
going to make forward progress.
I'm also concerned that we can never completely clean up after an error
like this. For example, what if the exception gets thrown at the end of
handling a task completion, where a job has already been marked as finished.
Then when we cancel the job, we'll end up with a bunch of duplicate messages.
I think I'm in favor of a solution where we fail the DAGScheduler on an
exception, and then catch the recoverable exceptions in the appropriate place
lower down in the code, where we can appropriately clean up the state.
Curious to hear what others' thoughts are here...
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---