Github user squito commented on the pull request:
https://github.com/apache/spark/pull/7028#issuecomment-117692804
Hi @aarondav @andrewor14 , I would like to discuss this a little more. I
think I didn't understand your proposal earlier and I didn't really express my
concerns properly -- sorry I am going to backpedal on what I said before.
My concern is not so much about the implementation details, but just how
this looks to the end user. Certainly this change is better than the status
quo. But I'd rather we add this the right way the first time, rather than put
in something now, make users learn this new exception format, and then down the
line we update it and users need to learn some new format. Understanding
exceptions from Spark is already a big source of user confusion imo, so its
worth spending a bit of time on this.
There are 3 stack traces here: (a) from the executor where the original
error occurred (b) where the exception was handled in
`DAGScheduler.handleTaskSetFailed` as part of the event loop and (c) where the
user code triggered the job with an action like `rdd.count()`. (b) is probably
totally useless for most users, its just occasionally useful for a spark
developer, but right now we give the user (a) + (b).
This pr puts (b) + (c) together as if they are one stack trace. I
definitely think that is an improvement -- users get to see (c), and mostly
they'll just ignore (b) anyway so it mostly doesn't matter. But there will
definitely be times that a curious user tries to understand the rest of the
stack trace. Maybe they hit some spark bug and they want to try to understand
it more before filing a jira, or they think that perhaps they are misusing
spark and the stack trace will help them understand better, etc. In those
cases, I think it'll cause a lot of confusion to have it appear that its all
one stack trace.
If we did have them separated as three different stack traces, it would be
much easier for a user to see understand in that case, and I think they'd also
be much more likely to look for an explanation of what the different parts are.
(As I said earlier, I would not think I need to consult documentation to
understand one simple stack trace.)
I put together an alternative implementation here:
https://github.com/apache/spark/pull/7156 (sadly the tests don't run b/c the
scala compiler crashes running mima ... but its just meant for discussion, not
to actually merge in any case, so just pretend the tests pass ...). That is an
alternate way to get all the stack traces -- it makes (a) + (b) the cause of
(c), so you get a stack trace like:
```
org.apache.spark.SparkException: job failed
at
org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:558)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1741)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1759)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1774)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1788)
at org.apache.spark.rdd.RDD.count(RDD.scala:1095)
at
org.apache.spark.scheduler.DAGSchedulerSuite$$anonfun$37$$anonfun$38.apply$mcJ$sp(DAGSchedulerSuite.scala:883)
...[snip]
at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.spark.SparkException: Job aborted due to stage
failure: Task 0 in stage 0.0 failed 1 times, most recent failure: Lost task 0.0
in stage 0.0 (TID 0, localhost): java.lang.RuntimeException: uh-oh!
at
org.apache.spark.scheduler.DAGSchedulerSuite$$anonfun$37$$anonfun$38$$anonfun$apply$mcJ$sp$2.apply(DAGSchedulerSuite.scala:883)
at
org.apache.spark.scheduler.DAGSchedulerSuite$$anonfun$37$$anonfun$38$$anonfun$apply$mcJ$sp$2.apply(DAGSchedulerSuite.scala:883)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at org.apache.spark.util.Utils$.getIteratorSize(Utils.scala:1627)
at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:1095)
at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:1095)
at
org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1774)
at
org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1774)
at
org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:63)
at org.apache.spark.scheduler.Task.run(Task.scala:70)
at
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Driver stacktrace:
at
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1294)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1285)
...[snip]
```
I don't love that version either -- we're still doing a bit of magic that
could confuse somebody if they look too closely. Though I do think that a
separate causing exception is at least a more clear break, rather than
something that just looks like part of the call stack. I'd really prefer if
we could keep all three separate.
Sorry to backtrack on my earlier comment. I don't mean to create an
unnecessary obstacle here. I do think the patch as-is is still an improvement
-- I suppose the curious user could just search the code for `==== Job
Submission ====`, we could just add a clear comment in the code explaining what
was going on. I haven't really thought about the implementation in any great
detail, so maybe it isn't worth it to do something more complicated. (Also, I
really haven't thought at all about what is going on w/ streaming -- perhaps
that also makes it impossible to make a bigger change.)
thanks for working on this and taking the time to discuss w/ me and read
through my rambling :)
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]