[
https://issues.apache.org/jira/browse/MAPREDUCE-4730?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Jason Lowe updated MAPREDUCE-4730:
----------------------------------
Status: Open (was: Patch Available)
I think immediately turning around and asking for the next MAX_EVENTS maps if
we just received MAX_EVENTS entries would be a straightforward way to eliminate
the sleep penalty. Unfortunately I don't think that will work all the time due
to another bug where the caller can receive less than MAX_EVENTS entries even
though that many entries were processed during the call.
TaskAttemptListenerImpl is calling Job.getTaskAttemptCompletionEvents with the
same fromEvent and maxEvents passed in from the reducer but is then filtering
the result for just map events. This means that even though we receive
maxEvents in completion events the caller could see less than that if there are
one or more reducer completion event mixed in there. Worse, if all of the
events are reducer events then zero events will be reported back to the caller
and it won't bump up fromEvent on the next call. Reducer never makes progress
and we're toast. This could happen in a case where all maps complete, more
than MAX_EVENTS reducers complete, but some straggling reducers get fetch
failures and cause a map to be restarted. This is less likely to occur with an
ask size of 10000 since you'd have to have 10000 reducers complete in a row,
but it's more likely with an ask size of 500.
> AM crashes due to OOM while serving up map task completion events
> -----------------------------------------------------------------
>
> Key: MAPREDUCE-4730
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-4730
> Project: Hadoop Map/Reduce
> Issue Type: Bug
> Components: applicationmaster, mrv2
> Affects Versions: 0.23.3
> Reporter: Jason Lowe
> Assignee: Jason Lowe
> Priority: Blocker
> Attachments: MAPREDUCE-4730.patch
>
>
> We're seeing a repeatable OOM crash in the AM for a task with around 30000
> maps and 3000 reducers. Details to follow.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira