Github user squito commented on the pull request:

    https://github.com/apache/spark/pull/8180#issuecomment-134294669
  
    I’ve only recently looked at making changes to the scheduler, but it 
seems to me there is widespread agreement among committers that it is very 
error prone.  For example, consider Andrew Or’s plea in 
[SPARK-8987](https://issues.apache.org/jira/browse/SPARK-8987) which starts 
with:
    > DAGScheduler is one of the most monstrous piece of code in Spark.
    
    Other recent examples of similar sentiments are in this [dicussion on 
backporting SPARK-8103](https://github.com/apache/spark/pull/7572) or confusion 
in the earlier versions of SPARK-5945, SPARK-7308, and SPARK-8103.  Even this 
[seemingly innocuous three line 
change](https://github.com/apache/spark/commit/702aa9d7fb16c98a50e046edfd76b8a7861d0391#diff-6a9ff7fb74fd490a50462d45db2d5e11R792)
 inadvertently introduced SPARK-9809 (its really lucky that somebody stumbled 
on that before the release).
    
    I’ve been working on issues related to fault-tolerance, primarily 
SPARK-8103 & SPARK-8029, which came from real customer escalations.  Those took 
me a *long* time to wrap my head around, after painfully trying to make sense 
of user logs, create a reproduction, propose a fix, convince others there 
really was something wrong, and get lots of help to make the right fix.  
    
    I did a bit of fault-injection testing as well, and things seemed to pass 
consistently after my fixes, so I was hoping that would be the end of the 
story.  But then I dug through some existing jiras, and found 
[SPARK-5259](https://issues.apache.org/jira/browse/SPARK-5259).  I couldn't 
believe it had been open since January!  A community member had discovered it 
and even very clearly described exactly how it happened, but still we haven't 
fixed it.  We're about to release spark 1.5 with it still broken, which means 
that's at least 3 releases where fault tolerance is knowingly broken.  I find 
that embarrassing.
    
    I'm not saying all of this to denigrate the effort that everyone has 
already put into it, but I just want to be clear that I really do mean it: we 
are unable to deal with the complexity of the scheduler.  IMO, the highest 
reward would be to fix the fault-tolerance issues, and focus on testing the 
scheduler so we gain more confidence in it.  So while this feature is 
interesting, I think we should proceed very cautiously.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to