[ 
https://issues.apache.org/jira/browse/HADOOP-2559?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12572801#action_12572801
 ] 

Owen O'Malley commented on HADOOP-2559:
---------------------------------------

I think you missed the point of his experiment. He ran only 20 maps on a 200 
node cluster to try and generate unbalanced distribution of blocks. Because 
word count largely is a scanning operation, you can see the better distribution 
of the blocks leading to improved times. Naturally, this will be better once 
hadoop-1985 is committed.

The most relevant missing piece of information would be the distribution of 
blocks in the input directory both per a node and per a rack. With trunk or 
patch1, he'd end up with those 20 nodes each having 5% of the blocks. (His 20 
nodes probably hit 80% of the 23 racks in the 900 node hod cluster he was 
running on, so it makes sense that trunk and patch1 aren't that far apart.) 
Patch 2, should have generated pretty even distribution across the nodes and 
racks (although the 20% non-local racks would probably have 33% fewer blocks 
than the local racks).

> DFS should place one replica per rack
> -------------------------------------
>
>                 Key: HADOOP-2559
>                 URL: https://issues.apache.org/jira/browse/HADOOP-2559
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: dfs
>            Reporter: Runping Qi
>            Assignee: lohit vijayarenu
>         Attachments: HADOOP-2559-1.patch, HADOOP-2559-2.patch
>
>
> Currently, when writing out a block, dfs will place one copy to a local data 
> node, one copy to a rack local node
> and another one to a remote node. This leads to a number of undesired 
> properties:
> 1. The block will be rack-local to two tacks instead of three, reducing the 
> advantage of rack locality based scheduling by 1/3.
> 2. The Blocks of a file (especiallya  large file) are unevenly distributed 
> over the nodes: One third will be on the local node, and two thirds on the 
> nodes on the same rack. This may make some nodes full much faster than 
> others, 
> increasing the need of rebalancing. Furthermore, this also make some nodes 
> become "hot spots" if those big 
> files are popular and accessed by many applications.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to