Github user tgravescs commented on the issue:
https://github.com/apache/spark/pull/15249
The question to me comes down to how many and how often do you expect
temporary resource issues. At some point if its just from that much skew you
should probably fix your configs and it would be by luck if you work or don't
(based on other tasks finishing in executor) before you hit the max task
failures. If its transient temporary network issues, retry could work, if they
are long lived network issue I expect a lot of things to fail.
Unfortunately I don't have enough data on how often this happens or what
exactly happens on Spark to know what would definitely help. I know on MR that
allowing multiple attempts on the same node sometimes works for at least some
of these temporary conditions. I would think the same would apply to Spark
although as mentioned the difference might be in the re-launch container time
vs just send task retry. Perhaps we should just change the defaults to allow
more then one task attempt per executor and accordingly increase
maxTaskAttemptsPerNode. then lets run with it and see what happens so we can
get some more data and enhance from there. If we want to be paranoid, leave the
existing blacklisting functionality in as a fallback in case new doesn't work.
Also if we are worried about resource limits, ie we blacklist to many
executors/nodes for to long then perhaps we should add in a fail safe that is
like a max % blacklisted. Once you hit that percent you don't blacklist anymore.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]