[
https://issues.apache.org/jira/browse/MAPREDUCE-7110?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16517779#comment-16517779
]
Tao Jie edited comment on MAPREDUCE-7110 at 6/20/18 4:14 AM:
-------------------------------------------------------------
Add new parameter {{mapreduce.job.task.delayed.retry.factor.ms}}, if is
{{0}}(default), task retry will not delay as current logic. When set to a
positive value (eg. 5000), the first retry will start immediately, the second
retry will delay for 5000ms, the third retry will delay for 2 * 5000ms, the
next will delay for 4 * 5000ms, and so on.
was (Author: tao jie):
Add new parameter {mapreduce.job.task.delayed.retry.factor.ms}, if is
{0}(default), task retry will not delay as current logic. When set to a
positive value (eg. 5000), the first retry will start immediately, the second
retry will delay for 5000ms, the third retry will delay for 2 * 5000ms, the
next will delay for 4 * 5000ms, and so on.
> Support delayed retry for MR task attempts
> ------------------------------------------
>
> Key: MAPREDUCE-7110
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-7110
> Project: Hadoop Map/Reduce
> Issue Type: Improvement
> Affects Versions: 2.8.2, 3.1.0
> Reporter: Tao Jie
> Assignee: Tao Jie
> Priority: Major
> Attachments: MAPREDUCE-7110.001.patch, MAPREDUCE-7110.002.patch
>
>
> Today when map/reduce task fails, it would retry 4 times until success by
> default.
> In our product cluster, datanodes may be offline for a while. In a map task,
> when the 3 datanodes on which the accessed block replicated go offline at the
> same time, this map attempt will fail. However in current logic the appmaster
> will launch the retry attempts immediately, and the retries will very likely
> fail again if those datanodes do not recover very soon. As a result, it will
> cauce the job to fail even the job has been running for several hours.
> In such a situation, we could have a delayed retry mechanism. For example we
> can have the first retry immediately, then the second retry will wait for
> 10s, the third retry will wait longer.
> It could be an option especially for jobs that runs for a long time and will
> not modify the current logic by default.
> Does it make sense? Any thought?
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]