[ 
https://issues.apache.org/jira/browse/HDFS-2537?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13149325#comment-13149325
 ] 

Uma Maheswara Rao G commented on HDFS-2537:
-------------------------------------------

*Dynamic Re-replication Analysis:*
      Considering average block size as *64 MB*, producing the replication work 
at the rate of 2*num_livenodes is not a low work generation. The main concern 
should be the processing of the blocks from replicateQueues in each DataNode 
descriptor for the heartbeats. Currently the consuming rate will be mainly 
depending on the dfs.namenode.replication.max-streams .  Deafult value is 2, so 
at any point of time only two threads will work on DN to process the 
replication work.
  Here is one simple solution to accelerate the work depending on DN loads: 
         We should have below config items
*dfs.namenode.replication.max-streams* represents, maximum datanodes threads to 
work on replication work. For now consider default value as 20.
*dfs.namenode.replication.min-streams* represents, minimum datanode threads to 
work on replication work. Consider default value as 2.
   For creating the DNA_TRANSFER command, we need to find the replication 
blocks list from corresponding DN descriptor. Here initially we need to find 
the SRC DN capacity.
  Below are the conditions to find the capacity:
1)      If SRC DN dataXceiver count is 20%, then 80% of 
dfs.namenode.replication.max-streams can work on replication.
2)      If SRC DN dataXceiver count is 0-5%, then 95% ( keeping 5% for 
tolarence) of dfs.namenode.replication.max-streams should work on replication.
3)      If SRC DN dataXceiver count is 95% (keeping 5% for tolarence), then 
dfs.namenode.replication.min-streams can work on replication.
4)      If SRC DN dataXceiver count is any other %, then capacity will be 
calculated as option 1.

 Once finding the SRC DN capacity, we need to find the blocks & targets to full 
fill the capacity of SRC DN.  
Targets also should have the capacity to work on replication task. If any block 
targets does not have the capacity, then that particular block should not 
selected for this heartbeat. Collect all such unselected blocks and place in 
front of the queue for checking these blocks immediately in next heartbeat. So 
the replication processing work will move between 
*dfs.namenode.replication.min-streams* and 
*dfs.namenode.replication.max-streams*
Below is the drawback with this solution:
If the NN generates the work at earliest and choosing the targets, can consider 
only the current load at DNs.
The problem here is, there is a high chance of choosing the same targets as 
choosen for previous block. Because the previous targets future work will not 
get replicated in terms of dataxceiver count immediately in DN. Work 
distribution also will not happen properly considering the load on DNs.
This can be solved by tracking the estimated future work and consider that 
future load while choosing the next target for the replication. And also DN 
should send the blocks reports which are in-progress in DataTransfer thread. 
Because DataTransfer threads will write blocks with WRITE op code.
So, this block write also will increment the xceiver count in targets. Since DN 
already incrementing this count, NN should decrement this count from future 
load.

 This will not work in *Federation* case, because one NN will not know about 
other NNs future replication load calculations for DNs.
So, still the above problem arises with Federation. 
Considering Federation, still need to think more to find the better solutions..

                
> re-replicating under replicated blocks should be more dynamic
> -------------------------------------------------------------
>
>                 Key: HDFS-2537
>                 URL: https://issues.apache.org/jira/browse/HDFS-2537
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>    Affects Versions: 0.20.205.0, 0.23.0
>            Reporter: Nathan Roberts
>
> When a node fails or is decommissioned, a large number of blocks become 
> under-replicated. Since re-replication work is distributed, the hope would be 
> that all blocks could be restored to their desired replication factor in very 
> short order. This doesn't happen though because the load the cluster is 
> willing to devote to this activity is mostly static (controlled by 
> configuration variables). Since it's mostly static, the rate has to be set 
> conservatively to avoid overloading the cluster with replication work.
> This problem is especially noticeable when you have lots of small blocks. It 
> can take many hours to re-replicate the blocks that were on a node while the 
> cluster is mostly idle. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to