[ https://issues.apache.org/jira/browse/HADOOP-5830?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12709456#action_12709456 ]
Hong Tang commented on HADOOP-5830: ----------------------------------- This is probably complementary to Hadoop-2560 (btw, is the new CombineFileInputFormat capable of picking blocks from different files and lump them into one input split?). I consider this approach is better because it could achieve better load balancing, and lower overhead of map failure or speculative execution. > Reuse output collectors across maps running on the same jvm > ----------------------------------------------------------- > > Key: HADOOP-5830 > URL: https://issues.apache.org/jira/browse/HADOOP-5830 > Project: Hadoop Core > Issue Type: Improvement > Components: mapred > Reporter: Arun C Murthy > > We have evidence that cutting the shuffle-crossbar between maps and reduces > (m * r) leads to perfomant applications since: > # It cuts down the number of connections necessary to shuffle and hence > reduces load on the serving-side (TaskTracker) and improves latency > (terasort, HADOOP-1338, HADOOP-5223) > # Reduces seeks required for the TaskTracker to serve the map-outputs > So far we've had to manually tune applications to cut down the shuffle- > crossbar by having fatter maps with custom input formats etc. For e.g. we saw > a significant improvement while running the petasort when we went from > ~800,000 maps to 80,00 maps (1.5G to 15G per map) i.e. from 48+ hours to 16 > hours, > The downsides are: > # The burden falls on the application-writer to tune this with custom > input-formats etc. > # The naive method of using a higher min.split.size leads to considerable > non-local i/o on the maps. > Given these, the proposal is to keep the 'output collector' open across jvm > reuse for maps, there-by enabling 'combiners' across map-tasks. This would > have the happy-effect of fixing both the above. The downsides are that it > will add latency to jobs (since map-outputs cannot be shuffled till a few > maps on the same jvm are done, then followed by a final sort/merge/combine) > and the failure cases get a bit more complicated. > Thoughts? Lets discuss... -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.