Jason Lowe created MAPREDUCE-4946:
-------------------------------------

             Summary: Type conversion of map completion events leads to 
performance problems with large jobs
                 Key: MAPREDUCE-4946
                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4946
             Project: Hadoop Map/Reduce
          Issue Type: Bug
          Components: mr-am
    Affects Versions: 0.23.5, 2.0.2-alpha
            Reporter: Jason Lowe
            Priority: Critical


We've seen issues with large jobs (e.g.: 13,000 maps and 3,500 reduces) where 
reducers fail to connect back to the AM after being launched due to connection 
timeout.  Looking at stack traces of the AM during this time we see a lot of 
IPC servers stuck waiting for a lock to get the application ID while type 
converting the map completion events.  What's odd is that normally getting the 
application ID should be very cheap, but in this case we're type-converting 
thousands of map completion events for *each* reducer connecting.  That means 
we end up type-converting the map completion events over 45 million times 
during the lifetime of the example job (13,000 * 3,500).

We either need to make the type conversion much cheaper (i.e.: lockless or at 
least read-write locked) or, even better, store the completion events in a form 
that does not require type conversion when serving them up to reducers.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to