Shangshu Qian created HDFS-17782:
------------------------------------

             Summary: The implementation of LowRedundancyBlocks can cause 
unexpected lock contentions, resulting in DN timeout
                 Key: HDFS-17782
                 URL: https://issues.apache.org/jira/browse/HDFS-17782
             Project: Hadoop HDFS
          Issue Type: Bug
          Components: datanode, namenode
    Affects Versions: 3.4.1
            Reporter: Shangshu Qian


The current implementation of LowRedundancyBlocks involves a lot of 
synchronized methods. The main user of this class, `neededReconstruction` of 
BlockManager frequently invokes those synchronized method. A feedback loop can 
occur when the synchronized methods causes lock contentions.

The feedback loop looks like this:
 # The cluster experiences a burst in IO. The BlockManager experiences lock 
contention on LowRedundancyBlocks.
 # Due to the lock contention, many of the RPC operations in the BlockManager 
get delayed, occupying the RPC pool for a long time.
 # The heartbeat from the DN get delayed due to the contention. We start to 
lose them and the blocks on them.
 # We need to replicate those missing blocks, cause even higher load on the DN 
as well as the block reports they send to the NN.
 # Those block reports interacts with BlockManager and eventually 
LowRedundancyBlocks, making the lock contention problem worse.

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

Reply via email to