[ 
https://issues.apache.org/jira/browse/MAPREDUCE-5352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13725893#comment-13725893
 ] 

Bikas Saha commented on MAPREDUCE-5352:
---------------------------------------

blockToNodes doesnt look like it needs to be a map?

What are the results of running the new test with the old code. From what I 
see, the test has a uniform distribution of blocks and the old code should pass 
the test too. The test by itself is a good test to have. Distribution fixes 
like the one in this patch are not easy to test anyways.

The code change looks correct overall. 

Would be great if we can ascertain how the performance of the new algo compares 
to the earlier one. e.g. how much time does it take to create splits for 1 
million blocks on 10000 machines with 4 blocks per split for example. I am 
expecting that looping once for every split will be slower though not quite 
sure how much.
                
> Optimize node local splits generated by CombineFileInputFormat 
> ---------------------------------------------------------------
>
>                 Key: MAPREDUCE-5352
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5352
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>    Affects Versions: 2.0.5-alpha
>            Reporter: Siddharth Seth
>            Assignee: Siddharth Seth
>         Attachments: MAPREDUCE-5352.1.txt, MAPREDUCE-5352.2.txt, 
> MAPREDUCE-5352.3.txt, MAPREDUCE-5352.4.txt
>
>
> CombineFileInputFormat currently walks through all available nodes and 
> generates multiple (maxSplitsPerNode) splits on a single node before 
> attempting to generate splits on subsequent nodes. This ends up reducing the 
> possibility of generating splits for subsequent nodes - since these blocks 
> will no longer be available for subsequent nodes. Allowing splits to go 1 
> block above the max-split-size makes this worse.
> Allocating a single split per node in one iteration, should help increase the 
> distribution of splits across nodes - so the subsequent nodes will have more 
> blocks to choose from.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to