[ 
https://issues.apache.org/jira/browse/HDFS-13279?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16431772#comment-16431772
 ] 

Ajay Kumar commented on HDFS-13279:
-----------------------------------

{quote}HDFS-11998 sets DFSNetworkTopology as default topology implementation 
even though net.topology.impl is set to NetworkTopology. In HDFS-11530, once 
dfs.use.dfs.network.topology is true, the implementation is hard code to 
DFSNetworkTopology no matter what net.topology.impl is. So we have to modify 
the behavior if we need to add a new topology implementation and let it work. 
Maybe we could fix it in another Jira?{quote}
This is the reason i suggested modifying {{DatanodeManager#init}}, may be can 
make it generic enough to handle similar scnerios in future.I think its ok to 
address it in another jira but then we should not modify the current default 
behaviour as new implementation is not tested at scale.

{quote}It is OK if we use first choose a rack then choose a node logic in 
chooseRandom. The purpose of a twice choosing is to mostly reuse the current 
choosing logic, which make the code more easier.{quote}
Sorry, do not understand your point clearly on this one. I think choosing node 
twice without considering weight increases the probability wrong node being 
selected first time which makes it little costlier compared to choosing rack 
initially based on rack weight. 



> Datanodes usage is imbalanced if number of nodes per rack is not equal
> ----------------------------------------------------------------------
>
>                 Key: HDFS-13279
>                 URL: https://issues.apache.org/jira/browse/HDFS-13279
>             Project: Hadoop HDFS
>          Issue Type: Bug
>    Affects Versions: 2.8.3, 3.0.0
>            Reporter: Tao Jie
>            Assignee: Tao Jie
>            Priority: Major
>         Attachments: HDFS-13279.001.patch, HDFS-13279.002.patch, 
> HDFS-13279.003.patch, HDFS-13279.004.patch, HDFS-13279.005.patch
>
>
> In a Hadoop cluster, number of nodes on a rack could be different. For 
> example, we have 50 Datanodes in all and 15 datanodes per rack, it would 
> remain 5 nodes on the last rack. In this situation, we find that storage 
> usage on the last 5 nodes would be much higher than other nodes.
>  With the default blockplacement policy, for each block, the first 
> replication has the same probability to write to each datanode, but the 
> probability for the 2nd/3rd replication to write to the last 5 nodes would 
> much higher than to other nodes. 
>  Consider we write 50 blocks to such 50 datanodes. The first rep of 100 block 
> would distirbuted to 50 node equally. The 2rd rep of blocks which the 1st rep 
> is on rack1(15 reps) would send equally to other 35 nodes and each nodes 
> receive 0.428 rep. So does blocks on rack2 and rack3. As a result, node on 
> rack4(5 nodes) would receive 1.29 replications in all, while other node would 
> receive 0.97 reps.
> ||-||Rack1(15 nodes)||Rack2(15 nodes)||Rack3(15 nodes)||Rack4(5 nodes)||
> |From rack1|-|15/35=0.43|0.43|0.43|
> |From rack2|0.43|-|0.43|0.43|
> |From rack3|0.43|0.43|-|0.43|
> |From rack4|5/45=0.11|0.11|0.11|-|
> |Total|0.97|0.97|0.97|1.29|



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

Reply via email to