Github user tgravescs commented on the issue:
https://github.com/apache/spark/pull/15249
Sorry I haven't followed this PR since it was split off the main one.
Response might be a bit split as its talking about various responses if
something doesn't make sense let me know.
> b) A few seconds to 10's of seconds is usually enough if the problem is
due to memory or disk |pressures.
Seems this would vary a lot. I've seen resource pressures last for hours
rather then seconds. Another application has all disks pegged on the node
while its doing a huge shuffle. I'm not really sure on the memory side, I
assume it was a skewed task and in this case other tasks finished and you just
happened to have enough memory to finish now? If this is the case why not just
run it on another executor or node anyway, seems like odds would be about the
same. I guess if you had the locality wait high enough it might not try another
executor first or if you had a small enough number of executors it could be an
issue.
Really the temporary resource thing kind of falls into what I was talking
about in the design with allowing more then 1 task attempt failure per executor
(which is why I wanted it configurable). On MapReduce we have seen this but on
MR generally you have a few seconds because it has to relaunch an entire jvm.
So one option that seems like its the same as the prior blacklisting would be
to have that (spark.blacklist.task.maxTaskAttemptsPerExecutor) > 1 and add an
additional timeout between attempts which would be basically same as
spark.scheduler.executorTaskBlacklistTime. thoughts?
I think blacklisting executor is ok especially in cases where you have a
bad disk because YARN should handle some of these cases for you and if you
create another executor on that node, it could give you a different list of
disks leaving out the bad disk. It could have also been a transient type issue
or one mridul mentioned with resources.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]