Shangshu Qian created HDFS-17661:
------------------------------------

             Summary: BlockRecoveryWorker may have a contention with the 
BPServiceActor, causing missing IBRs
                 Key: HDFS-17661
                 URL: https://issues.apache.org/jira/browse/HDFS-17661
             Project: Hadoop HDFS
          Issue Type: Bug
          Components: datanode
            Reporter: Shangshu Qian


We found that a large number of BlockRecoveryWorker may cause 
IncrementalBlockReports (IBRs) to be delayed due to IOExceptions. Under some 
edge cases, DataNode may run into a feedback loop.

The feedback loop can happen when the cluster is under high load:
 # A high load in DN may trigger an IOException in the 
IncrementalBlockReportManager.sendIBRs(). Under the current implementation, the 
IBR is requeued and the IOE is swallowed. Assume some block deletions are 
delayed at this point.
 # When the DataXceiver transfers a block, 
DataNode.transferReplicaForPipelineRecovery() can hit an IOE when it cannot 
retrieve a block locally. This can happen when the IBR containing the block 
deletion is delayed.
 # The IOE in the write pipeline can trigger a pipeline rebuild, and slows down 
the client. The client's lease now has a higher chance to be taken over by 
another client or expire. Both cases can trigger a lease recovery, which 
includes a block recovery and puts more workload into the DN.

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org

Reply via email to