[
https://issues.apache.org/jira/browse/MAPREDUCE-7110?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Tao Jie updated MAPREDUCE-7110:
-------------------------------
Attachment: MAPREDUCE-7110.003.patch
> Support delayed retry for MR task attempts
> ------------------------------------------
>
> Key: MAPREDUCE-7110
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-7110
> Project: Hadoop Map/Reduce
> Issue Type: Improvement
> Affects Versions: 2.8.2, 3.1.0
> Reporter: Tao Jie
> Assignee: Tao Jie
> Priority: Major
> Attachments: MAPREDUCE-7110.001.patch, MAPREDUCE-7110.002.patch,
> MAPREDUCE-7110.003.patch
>
>
> Today when map/reduce task fails, it would retry 4 times until success by
> default.
> In our product cluster, datanodes may be offline for a while. In a map task,
> when the 3 datanodes on which the accessed block replicated go offline at the
> same time, this map attempt will fail. However in current logic the appmaster
> will launch the retry attempts immediately, and the retries will very likely
> fail again if those datanodes do not recover very soon. As a result, it will
> cauce the job to fail even the job has been running for several hours.
> In such a situation, we could have a delayed retry mechanism. For example we
> can have the first retry immediately, then the second retry will wait for
> 10s, the third retry will wait longer.
> It could be an option especially for jobs that runs for a long time and will
> not modify the current logic by default.
> Does it make sense? Any thought?
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]