[ 
https://issues.apache.org/jira/browse/HADOOP-923?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12471803
 ] 

Hairong Kuang commented on HADOOP-923:
--------------------------------------

Two comments:
1. I feel that it is not neccessary to balance # of transfers when the 
heartbeat thread picks up the replication work. First the background thread 
that computes pendingTransfers has already balanced the load. Second block 
replication work needs to be done asap to avoid data loss. Since the datanode 
has been assinged the block replication work, no other datanode is able to pick 
up the work. If the work does not get to send to the datanode in the current 
heartbeat, it has to wait for at least another heartbeat interval.

2. The background thread that computes pendindingTransfer scans only 100 
datanodes per interation and then sleep for 3 seconds. I feel that the approach 
does not scale well. For example, when a cluster size becomes 2000, a 
datanode's work gets computed every 2000/100*3=1min if we ignore the 
computation overhead, which is far less frequently than what we do now (every 3 
seonds). Another minor flaw is that the thread uses the index to record the 
next node to be checked. But if the heartbeat queue gets updated between two 
consecutive interations, the index may not point to the right node.

> DFS Scalability: datanode heartbeat timeouts cause cascading timeouts of 
> other datanodes
> ----------------------------------------------------------------------------------------
>
>                 Key: HADOOP-923
>                 URL: https://issues.apache.org/jira/browse/HADOOP-923
>             Project: Hadoop
>          Issue Type: Bug
>          Components: dfs
>    Affects Versions: 0.10.1
>            Reporter: dhruba borthakur
>         Assigned To: dhruba borthakur
>         Attachments: pendingTransferThread.patch
>
>
> The datanode sends a heartbeat to the namenode every 3 seconds. The namenode 
> processes the heartbeat and sends  a list of block-to-be-replicated and 
> blocks-to-be-deleted as part of the heartbeat response.
> At times when a couple of datanodes fail, the heartbeat processing on the 
> namenode becomes pretty heavyweight. It acquires the global FSNamesystem 
> lock, traverses the neededReplication structure, generates a list of blocks 
> to be replicated and responds to the heartbeat message. Determining the list 
> of blocks-to-be-replciated is pretty heavyweight, takes plenty of CPU and 
> blocks processing of other heartbeats because of the global FSNamesystem lock.
> It would improve scalability a lot if heartbeat processing does not require 
> the FSNamesystem lock. In fact, the pre-existing "heartbeat" lock already 
> exists for this purpose. 
> I propose that the Heartbeat message be separate from the "retrieve 
> blocks-to-replicate and blocks-to-delete" messages. The datanode can continue 
> to heartbeat once every 3 seconds while it can afford to "retrieve 
> blocks-to-replicate" at a much coarser interval. Heartbeat processing on the 
> namenode will be fast because it does not require the global FSNamesystem 
> lock. Moreover, a datanode failure will not aggrevate the heartbeat 
> processing time on the namenode.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to