Github user squito commented on the issue:
https://github.com/apache/spark/pull/22288
Ok I looked at jiras, and this looks it also covers SPARK-15815, right? you
could add that to the summary too.
You mention some future improvements:
> Taking into account static allocation
I mentioned this on an inline comment too, but now that I'm thinking about,
it seems like this will be fine with static allocation as well. It just seems
like the problem is the worst in DA, as you can end up with one executor left
for the straggler task, and then that executor gets blacklisted. But, with
static allocation, maybe you only requested a small number of executors on a
large cluster, and by chance you get them all on a host with bad disks, so then
everything starts failing. You could still just kill those executors and
request new ones to keep things going. Anything I'm missing?
> Querying the RM to figure out if its a small cluster, then try to wait
some more or abort immediately.
what's the concern here -- that if you're on a small cluster, there is very
little chance of getting a good replacement so you should go back to failing
fast? I guess that would be nice, but much less important in my opinion.
> Try to distinguish between waiting for time while you acquire an executor
and time for being unable to schedule a task.
I don't understand this part -- do you mean for locality preferences?
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]