[ 
https://issues.apache.org/jira/browse/MAPREDUCE-4730?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Lowe updated MAPREDUCE-4730:
----------------------------------

    Status: Open  (was: Patch Available)

I think immediately turning around and asking for the next MAX_EVENTS maps if 
we just received MAX_EVENTS entries would be a straightforward way to eliminate 
the sleep penalty.  Unfortunately I don't think that will work all the time due 
to another bug where the caller can receive less than MAX_EVENTS entries even 
though that many entries were processed during the call.

TaskAttemptListenerImpl is calling Job.getTaskAttemptCompletionEvents with the 
same fromEvent and maxEvents passed in from the reducer but is then filtering 
the result for just map events.  This means that even though we receive 
maxEvents in completion events the caller could see less than that if there are 
one or more reducer completion event mixed in there.  Worse, if all of the 
events are reducer events then zero events will be reported back to the caller 
and it won't bump up fromEvent on the next call.  Reducer never makes progress 
and we're toast.  This could happen in a case where all maps complete, more 
than MAX_EVENTS reducers complete, but some straggling reducers get fetch 
failures and cause a map to be restarted.  This is less likely to occur with an 
ask size of 10000 since you'd have to have 10000 reducers complete in a row, 
but it's more likely with an ask size of 500.
                
> AM crashes due to OOM while serving up map task completion events
> -----------------------------------------------------------------
>
>                 Key: MAPREDUCE-4730
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4730
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: applicationmaster, mrv2
>    Affects Versions: 0.23.3
>            Reporter: Jason Lowe
>            Assignee: Jason Lowe
>            Priority: Blocker
>         Attachments: MAPREDUCE-4730.patch
>
>
> We're seeing a repeatable OOM crash in the AM for a task with around 30000 
> maps and 3000 reducers.  Details to follow.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to