Jason Lowe created MAPREDUCE-4946:
-------------------------------------
Summary: Type conversion of map completion events leads to
performance problems with large jobs
Key: MAPREDUCE-4946
URL: https://issues.apache.org/jira/browse/MAPREDUCE-4946
Project: Hadoop Map/Reduce
Issue Type: Bug
Components: mr-am
Affects Versions: 0.23.5, 2.0.2-alpha
Reporter: Jason Lowe
Priority: Critical
We've seen issues with large jobs (e.g.: 13,000 maps and 3,500 reduces) where
reducers fail to connect back to the AM after being launched due to connection
timeout. Looking at stack traces of the AM during this time we see a lot of
IPC servers stuck waiting for a lock to get the application ID while type
converting the map completion events. What's odd is that normally getting the
application ID should be very cheap, but in this case we're type-converting
thousands of map completion events for *each* reducer connecting. That means
we end up type-converting the map completion events over 45 million times
during the lifetime of the example job (13,000 * 3,500).
We either need to make the type conversion much cheaper (i.e.: lockless or at
least read-write locked) or, even better, store the completion events in a form
that does not require type conversion when serving them up to reducers.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira