Shangshu Qian created HDFS-17661: ------------------------------------ Summary: BlockRecoveryWorker may have a contention with the BPServiceActor, causing missing IBRs Key: HDFS-17661 URL: https://issues.apache.org/jira/browse/HDFS-17661 Project: Hadoop HDFS Issue Type: Bug Components: datanode Reporter: Shangshu Qian
We found that a large number of BlockRecoveryWorker may cause IncrementalBlockReports (IBRs) to be delayed due to IOExceptions. Under some edge cases, DataNode may run into a feedback loop. The feedback loop can happen when the cluster is under high load: # A high load in DN may trigger an IOException in the IncrementalBlockReportManager.sendIBRs(). Under the current implementation, the IBR is requeued and the IOE is swallowed. Assume some block deletions are delayed at this point. # When the DataXceiver transfers a block, DataNode.transferReplicaForPipelineRecovery() can hit an IOE when it cannot retrieve a block locally. This can happen when the IBR containing the block deletion is delayed. # The IOE in the write pipeline can trigger a pipeline rebuild, and slows down the client. The client's lease now has a higher chance to be taken over by another client or expire. Both cases can trigger a lease recovery, which includes a block recovery and puts more workload into the DN. -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org