Josh Rosen created SPARK-3289:
---------------------------------
Summary: Prevent complete job failures due to rescheduling of
failing tasks on buggy machines
Key: SPARK-3289
URL: https://issues.apache.org/jira/browse/SPARK-3289
Project: Spark
Issue Type: Bug
Components: Spark Core
Reporter: Josh Rosen
Some users have reported issues where a task fails due to an environment /
configuration issue on some machine, then the task is reattempted _on that same
buggy machine_ until the entire job failures because that single task has
failed too many times.
To guard against this, maybe we should add some randomization in how we
reschedule failed tasks.
--
This message was sent by Atlassian JIRA
(v6.2#6252)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]