[ 
https://issues.apache.org/jira/browse/MAPREDUCE-7110?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16517779#comment-16517779
 ] 

Tao Jie edited comment on MAPREDUCE-7110 at 6/20/18 4:14 AM:
-------------------------------------------------------------

Add new parameter {{mapreduce.job.task.delayed.retry.factor.ms}}, if is 
{{0}}(default), task retry will not delay as current logic. When set to a 
positive value (eg. 5000), the first retry will start immediately, the second 
retry will delay for 5000ms, the third retry will delay for 2 * 5000ms, the 
next will delay for 4 * 5000ms, and so on. 


was (Author: tao jie):
Add new parameter {mapreduce.job.task.delayed.retry.factor.ms}, if is 
{0}(default), task retry will not delay as current logic. When set to a 
positive value (eg. 5000), the first retry will start immediately, the second 
retry will delay for 5000ms, the third retry will delay for 2 * 5000ms, the 
next will delay for 4 * 5000ms, and so on. 

> Support delayed retry for MR task attempts
> ------------------------------------------
>
>                 Key: MAPREDUCE-7110
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-7110
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>    Affects Versions: 2.8.2, 3.1.0
>            Reporter: Tao Jie
>            Assignee: Tao Jie
>            Priority: Major
>         Attachments: MAPREDUCE-7110.001.patch, MAPREDUCE-7110.002.patch
>
>
> Today when map/reduce task fails, it would retry 4 times until success by 
> default.
> In our product cluster, datanodes may be offline for a while. In a map task, 
> when the 3 datanodes on which the accessed block replicated go offline at the 
> same time, this map attempt will fail. However in current logic the appmaster 
> will launch the retry attempts immediately, and the retries will very likely 
> fail again if those datanodes do not recover very soon. As a result, it will 
> cauce the job to fail even the job has been running for several hours.
> In such a situation, we could have a delayed retry mechanism. For example we 
> can have the first retry immediately, then the second retry will wait for 
> 10s, the third retry will wait longer.
> It could be an option especially for jobs that runs for a long time and will 
> not modify the current logic by default. 
> Does it make sense? Any thought?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to