[ https://issues.apache.org/jira/browse/MAPREDUCE-4892?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Vinod Kumar Vavilapalli updated MAPREDUCE-4892: ----------------------------------------------- Status: Open (was: Patch Available) I can think of a case where this breaks locality: Nodes: N1, N2, N3 Block allocation: 9 on N1, 1 on N2, 8 on N3 Rack: default Num Blocks needed: 9 (numMaps) Blocks(N1) = Blocks(N2) union Blocks(N3) Average # blocks per node determined by patch = 3 (9 numMaps / 3 nodes) Assignment before patch: 9 splits on N1. (suboptimal spread, but fully local) Final assignment after patch: 3 on N1, 1 on N2, 3 on N3 and the final 2 on rack. Assuming my analysis is correct: I think we should first note the total number of max possible local splits (9 here) and start giving one split per node, circle through all nodes and keep doing it while all max possible splits are created. This loop will end up being similar to the rack loop that we have after the node local assignments. > CombineFileInputFormat node input split can be skewed on small clusters > ----------------------------------------------------------------------- > > Key: MAPREDUCE-4892 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-4892 > Project: Hadoop Map/Reduce > Issue Type: Bug > Reporter: Bikas Saha > Assignee: Bikas Saha > Fix For: 3.0.0 > > Attachments: MAPREDUCE-4892.1.patch > > > The CombineFileInputFormat split generation logic tries to group blocks by > node in order to create splits. It iterates through the nodes and creates > splits on them until there aren't enough blocks left on a node that can be > grouped into a valid split. If the first few nodes have a lot of blocks on > them then they can end up getting a disproportionately large share of the > total number of splits created. This can result in poor locality of maps. > This problem is likely to happen on small clusters where its easier to create > a skew in the distribution of blocks on nodes. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira