Tao Jie created MAPREDUCE-7110:
----------------------------------

             Summary: Support delayed retry for MR task attempts
                 Key: MAPREDUCE-7110
                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-7110
             Project: Hadoop Map/Reduce
          Issue Type: Improvement
    Affects Versions: 3.1.0, 2.8.2
            Reporter: Tao Jie
            Assignee: Tao Jie


Today when map/reduce task fails, it would retry 4 times until success by 
default.
In our product cluster, datanodes may be offline for a while. In a map task, 
when the 3 datanodes on which the accessed block replicated go offline at the 
same time, this map attempt will fail. However in current logic the appmaster 
will launch the retry attempts immediately, and the retries will very likely 
fail again if those datanodes do not recover very soon. As a result, it will 
cauce the job to fail even the job has been running for several hours.
In such a situation, we could have a delayed retry mechanism. For example we 
can have the first retry immediately, then the second retry will wait for 10s, 
the third retry will wait longer.
It could be an option especially for jobs that runs for a long time and will 
not modify the current logic by default. 
Does it make sense? Any thought?




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to