Github user squito commented on the pull request:
https://github.com/apache/spark/pull/8180#issuecomment-134294669
Iâve only recently looked at making changes to the scheduler, but it
seems to me there is widespread agreement among committers that it is very
error prone. For example, consider Andrew Orâs plea in
[SPARK-8987](https://issues.apache.org/jira/browse/SPARK-8987) which starts
with:
> DAGScheduler is one of the most monstrous piece of code in Spark.
Other recent examples of similar sentiments are in this [dicussion on
backporting SPARK-8103](https://github.com/apache/spark/pull/7572) or confusion
in the earlier versions of SPARK-5945, SPARK-7308, and SPARK-8103. Even this
[seemingly innocuous three line
change](https://github.com/apache/spark/commit/702aa9d7fb16c98a50e046edfd76b8a7861d0391#diff-6a9ff7fb74fd490a50462d45db2d5e11R792)
inadvertently introduced SPARK-9809 (its really lucky that somebody stumbled
on that before the release).
Iâve been working on issues related to fault-tolerance, primarily
SPARK-8103 & SPARK-8029, which came from real customer escalations. Those took
me a *long* time to wrap my head around, after painfully trying to make sense
of user logs, create a reproduction, propose a fix, convince others there
really was something wrong, and get lots of help to make the right fix.
I did a bit of fault-injection testing as well, and things seemed to pass
consistently after my fixes, so I was hoping that would be the end of the
story. But then I dug through some existing jiras, and found
[SPARK-5259](https://issues.apache.org/jira/browse/SPARK-5259). I couldn't
believe it had been open since January! A community member had discovered it
and even very clearly described exactly how it happened, but still we haven't
fixed it. We're about to release spark 1.5 with it still broken, which means
that's at least 3 releases where fault tolerance is knowingly broken. I find
that embarrassing.
I'm not saying all of this to denigrate the effort that everyone has
already put into it, but I just want to be clear that I really do mean it: we
are unable to deal with the complexity of the scheduler. IMO, the highest
reward would be to fix the fault-tolerance issues, and focus on testing the
scheduler so we gain more confidence in it. So while this feature is
interesting, I think we should proceed very cautiously.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]