Sangjin Lee created MAPREDUCE-5817:
--------------------------------------

             Summary: mappers get rescheduled on node transition even after all 
reducers are completed
                 Key: MAPREDUCE-5817
                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5817
             Project: Hadoop Map/Reduce
          Issue Type: Bug
          Components: applicationmaster
    Affects Versions: 2.3.0
            Reporter: Sangjin Lee


We're seeing a behavior where a job runs long after all reducers were already 
finished. We found that the job was rescheduling and running a number of 
mappers beyond the point of reducer completion. In one situation, the job ran 
for some 9 more hours after all reducers completed!

This happens because whenever a node transition (to an unusable state) comes 
into the app master, it just reschedules all mappers that already ran on the 
node in all cases.

Therefore, if any node transition has a potential to extend the job period. 
Once this window opens, another node transition can prolong it, and this can 
happen indefinitely in theory.

If there is some instability in the pool (unhealthy, etc.) for a duration, then 
any big job is severely vulnerable to this problem.

If all reducers have been completed, JobImpl.actOnUnusableNode() should not 
reschedule mapper tasks. If all reducers are completed, the mapper outputs are 
no longer needed, and there is no need to reschedule mapper tasks as they would 
not be consumed anyway.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to