Yuanbo Liu created HDFS-16657: --------------------------------- Summary: Changing pool-level lock to volume-level lock for invalidation of blocks Key: HDFS-16657 URL: https://issues.apache.org/jira/browse/HDFS-16657 Project: Hadoop HDFS Issue Type: Sub-task Reporter: Yuanbo Liu Attachments: image-2022-07-13-10-25-37-383.png, image-2022-07-13-10-27-01-386.png, image-2022-07-13-10-27-44-258.png
Recently we see that the heartbeating of dn become slow in a very busy cluster, here is the chart: !image-2022-07-13-10-25-37-383.png! After getting jstack of the dn, we find that dn heartbeat stuck in invalidation of blocks: !image-2022-07-13-10-27-01-386.png! !image-2022-07-13-10-27-44-258.png! The key code is: {code:java} // code placeholder try { File blockFile = new File(info.getBlockURI()); if (blockFile != null && blockFile.getParentFile() == null) { errors.add("Failed to delete replica " + invalidBlks[i] + ". Parent not found for block file: " + blockFile); continue; } } catch(IllegalArgumentException e) { LOG.warn("Parent directory check failed; replica " + info + " is not backed by a local file"); } {code} DN is trying to locate parent path of block file, thus there is a disk I/O in pool-level lock. When the disk becomes very busy with high io wait, All the pending threads will be blocked by the pool-level lock, and the time of heartbeat is high. We proposal to change the pool-level lock to volume-level lock for block invalidation cc: [~hexiaoqiao] [~Aiphag0] -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org