[ 
https://issues.apache.org/jira/browse/MAPREDUCE-7110?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tao Jie updated MAPREDUCE-7110:
-------------------------------
    Attachment: MAPREDUCE-7110.001.patch

> Support delayed retry for MR task attempts
> ------------------------------------------
>
>                 Key: MAPREDUCE-7110
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-7110
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>    Affects Versions: 2.8.2, 3.1.0
>            Reporter: Tao Jie
>            Assignee: Tao Jie
>            Priority: Major
>         Attachments: MAPREDUCE-7110.001.patch
>
>
> Today when map/reduce task fails, it would retry 4 times until success by 
> default.
> In our product cluster, datanodes may be offline for a while. In a map task, 
> when the 3 datanodes on which the accessed block replicated go offline at the 
> same time, this map attempt will fail. However in current logic the appmaster 
> will launch the retry attempts immediately, and the retries will very likely 
> fail again if those datanodes do not recover very soon. As a result, it will 
> cauce the job to fail even the job has been running for several hours.
> In such a situation, we could have a delayed retry mechanism. For example we 
> can have the first retry immediately, then the second retry will wait for 
> 10s, the third retry will wait longer.
> It could be an option especially for jobs that runs for a long time and will 
> not modify the current logic by default. 
> Does it make sense? Any thought?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to