Github user squito commented on the issue:
https://github.com/apache/spark/pull/13603
@kayousterhout sure I'll pull the visibility stuff out.
I did consider trying to do a check on task failure instead. However, I
don't think that is sufficient, because you can have an executor fail. Imagine
you have task 1 on executor A & task 2 on executor B. Task 1 fails, gets
blacklisted from executor A -- but it can still be scheduled on executor B so
you don't fail the stage. Then executor B dies. Task 2 can run on executor A,
so that isn't stuck. But task 1 now can't run anywhere.
Probably unlikely, but still having the job just hang is so bad that I
think we really should avoid it. Plus it becomes much more likely w/ the new
blacklisting I'm working on -- in that case, executor B gets blacklisted for
the bad stage because of many task failures, and now there isn't any place for
the first failed tasks to run. I actually ran into that case when testing an
early iteration of that change.
This is subtle enough its probably worth codifying into a test -- I'll work
on adding that.
(I agree with you that its OK to fail the task set even if a new executor
is just about to launch. Even this version doesn't really avoid something like
that.)
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]