Tao Jie created MAPREDUCE-7110:
----------------------------------
Summary: Support delayed retry for MR task attempts
Key: MAPREDUCE-7110
URL: https://issues.apache.org/jira/browse/MAPREDUCE-7110
Project: Hadoop Map/Reduce
Issue Type: Improvement
Affects Versions: 3.1.0, 2.8.2
Reporter: Tao Jie
Assignee: Tao Jie
Today when map/reduce task fails, it would retry 4 times until success by
default.
In our product cluster, datanodes may be offline for a while. In a map task,
when the 3 datanodes on which the accessed block replicated go offline at the
same time, this map attempt will fail. However in current logic the appmaster
will launch the retry attempts immediately, and the retries will very likely
fail again if those datanodes do not recover very soon. As a result, it will
cauce the job to fail even the job has been running for several hours.
In such a situation, we could have a delayed retry mechanism. For example we
can have the first retry immediately, then the second retry will wait for 10s,
the third retry will wait longer.
It could be an option especially for jobs that runs for a long time and will
not modify the current logic by default.
Does it make sense? Any thought?
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]