[
https://issues.apache.org/jira/browse/MAPREDUCE-4730?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13478496#comment-13478496
]
Jason Lowe commented on MAPREDUCE-4730:
---------------------------------------
Here's what I have gathered so far from a heap dump of an AM attempt that was
just about to run out of memory. Most of the memory was consumed by byte
buffers, specifically it looked like most of them were RPC response buffers.
I think there might be a flow control issue in the IPC layer that lead to this.
More than half of the mappers finished before the first reducer started, and
thousands of reducers all launched within a few seconds of each other. They
all asked the AM for map completion task events, which currently caps the
response to 10000 events per query. Since more than 10000 maps completed
before the first reducers started, each reducer saw a full event list which
took around 900K for each response buffer. There were many IPC Handler threads
to service the calls, but only one Responder thread to send out the rather
large response buffers. I see there's a blocking queue to prevent too many
calls from coming in at once, but I didn't see any flow control between the
Handlers and the Responder thread. It appears that as long as the Handler
threads can keep up with call queue relatively low, they can continue to queue
up call response data faster than the Responder thread can send it out.
Eventually this will exhaust available memory leading to an OOM.
> AM crashes due to OOM while serving up map task completion events
> -----------------------------------------------------------------
>
> Key: MAPREDUCE-4730
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-4730
> Project: Hadoop Map/Reduce
> Issue Type: Bug
> Components: applicationmaster, mrv2
> Affects Versions: 0.23.3
> Reporter: Jason Lowe
> Priority: Blocker
>
> We're seeing a repeatable OOM crash in the AM for a task with around 30000
> maps and 3000 reducers. Details to follow.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira