Github user kayousterhout commented on the pull request:

    https://github.com/apache/spark/pull/186#issuecomment-38336673
  
    I feel conflicted about this -- I wonder if it's better to just never 
restart (i.e., when an exception gets thrown, end the SparkContext with an 
exception).  Have others seen examples of exceptions in the DAGScheduler that 
we can recover from?
    
    I filed the JIRA leading to this change because a PythonAccumulator was 
throwing a SparkException (exposing this DAGScheduler issue).  I originally 
tried to fix the handling of this exception (see 
https://spark-project.atlassian.net/browse/SPARK-1290) but the exception gets 
thrown when the Python Spark Context gets disconnected from the Scala Spark 
Context -- so I don't think the problem can be recovered from.  In other words, 
for that case, I think it makes most sense to just die, because we're never 
going to make forward progress.
    
    I'm also concerned that we can never completely clean up after an error 
like this.  For example, what if the exception gets thrown at the end of 
handling a task completion, where a job has already been marked as finished.  
Then when we cancel the job, we'll end up with a bunch of duplicate messages.  
I think I'm in favor of a solution where we fail the DAGScheduler on an 
exception, and then catch the recoverable exceptions in the appropriate place 
lower down in the code, where we can appropriately clean up the state.
    
    Curious to hear what others' thoughts are here...


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

Reply via email to