Jason Lowe created MAPREDUCE-5043:
-------------------------------------

             Summary: Fetch failure processing can cause AM event queue to 
backup and eventually OOM
                 Key: MAPREDUCE-5043
                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5043
             Project: Hadoop Map/Reduce
          Issue Type: Bug
          Components: mr-am
    Affects Versions: 0.23.7, 2.0.4-beta
            Reporter: Jason Lowe
            Assignee: Jason Lowe
            Priority: Blocker


Saw an MRAppMaster with a 3G heap OOM.  Upon investigating another instance of 
it running, we saw the UI in a weird state where the task table and task 
attempt tables in the job overview page weren't consistent.  The AM log showed 
the AsyncDispatcher had hundreds of thousands of events in the event queue, and 
jstacks showed it spending a lot of time in fetch failure processing.  It turns 
out fetch failure processing is currently *very* expensive, with a triple 
{{for}} loop where the inner loop is calling the quite-expensive 
{{TaskAttempt.getReport}}.  That function ends up type-converting the entire 
task report, counters and all, and performing locale conversions among other 
things.  It does this for every reduce task in the job, for every map task that 
failed.  And when it's done building up the large task report, it pulls out one 
field, the phase, then throws the report away.

While the AM is busy processing fetch failures, tasks attempts are continuing 
to send events to the AM including memory-expensive events like status updates 
which include the counters.  These back up in the AsyncDispatcher event queue 
and eventually even an AM with a large heap size will run out of memory and 
crash or expire because it thrashes in garbage collect.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to