Shangshu Qian created HDFS-17782: ------------------------------------ Summary: The implementation of LowRedundancyBlocks can cause unexpected lock contentions, resulting in DN timeout Key: HDFS-17782 URL: https://issues.apache.org/jira/browse/HDFS-17782 Project: Hadoop HDFS Issue Type: Bug Components: datanode, namenode Affects Versions: 3.4.1 Reporter: Shangshu Qian
The current implementation of LowRedundancyBlocks involves a lot of synchronized methods. The main user of this class, `neededReconstruction` of BlockManager frequently invokes those synchronized method. A feedback loop can occur when the synchronized methods causes lock contentions. The feedback loop looks like this: # The cluster experiences a burst in IO. The BlockManager experiences lock contention on LowRedundancyBlocks. # Due to the lock contention, many of the RPC operations in the BlockManager get delayed, occupying the RPC pool for a long time. # The heartbeat from the DN get delayed due to the contention. We start to lose them and the blocks on them. # We need to replicate those missing blocks, cause even higher load on the DN as well as the block reports they send to the NN. # Those block reports interacts with BlockManager and eventually LowRedundancyBlocks, making the lock contention problem worse. -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org