[
https://issues.apache.org/jira/browse/HADOOP-2559?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12572801#action_12572801
]
Owen O'Malley commented on HADOOP-2559:
---------------------------------------
I think you missed the point of his experiment. He ran only 20 maps on a 200
node cluster to try and generate unbalanced distribution of blocks. Because
word count largely is a scanning operation, you can see the better distribution
of the blocks leading to improved times. Naturally, this will be better once
hadoop-1985 is committed.
The most relevant missing piece of information would be the distribution of
blocks in the input directory both per a node and per a rack. With trunk or
patch1, he'd end up with those 20 nodes each having 5% of the blocks. (His 20
nodes probably hit 80% of the 23 racks in the 900 node hod cluster he was
running on, so it makes sense that trunk and patch1 aren't that far apart.)
Patch 2, should have generated pretty even distribution across the nodes and
racks (although the 20% non-local racks would probably have 33% fewer blocks
than the local racks).
> DFS should place one replica per rack
> -------------------------------------
>
> Key: HADOOP-2559
> URL: https://issues.apache.org/jira/browse/HADOOP-2559
> Project: Hadoop Core
> Issue Type: Improvement
> Components: dfs
> Reporter: Runping Qi
> Assignee: lohit vijayarenu
> Attachments: HADOOP-2559-1.patch, HADOOP-2559-2.patch
>
>
> Currently, when writing out a block, dfs will place one copy to a local data
> node, one copy to a rack local node
> and another one to a remote node. This leads to a number of undesired
> properties:
> 1. The block will be rack-local to two tacks instead of three, reducing the
> advantage of rack locality based scheduling by 1/3.
> 2. The Blocks of a file (especiallya large file) are unevenly distributed
> over the nodes: One third will be on the local node, and two thirds on the
> nodes on the same rack. This may make some nodes full much faster than
> others,
> increasing the need of rebalancing. Furthermore, this also make some nodes
> become "hot spots" if those big
> files are popular and accessed by many applications.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.