Rick Weber created HDFS-17392: --------------------------------- Summary: NameNode rolls frequently with "EC replicas to be deleted are not in the candidate" error Key: HDFS-17392 URL: https://issues.apache.org/jira/browse/HDFS-17392 Project: Hadoop HDFS Issue Type: Bug Components: namenode Affects Versions: 3.3.6 Reporter: Rick Weber
Recently upgraded my clusters from Hadoop v3.3.4 to Hadoop v3.3.6 and noticed a lot of Namenode instability. Basically after about 1 hour, the active namenode shuts down and the "next" one takes over. Looking into the shutdown reasons, I'm seeing errors similar to {code:java} 2024-02-20 12:05:37,352 INFO org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Rescan of postponedMisreplicatedBlocks completed in 8 msecs. 6639943 blocks are left. 1 blocks were removed. 2024-02-20 12:05:37,352 ERROR org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: RedundancyMonitor thread received Runtime exception. java.lang.IllegalArgumentException: The EC replicas to be deleted are not in the candidate list at org.apache.hadoop.thirdparty.com.google.common.base.Preconditions.checkArgument(Preconditions.java:144) at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseExcessRedundancyStriped(BlockManager.java:4082) at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseExcessRedundancies(BlockManager.java:3970) at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.processExtraRedundancyBlock(BlockManager.java:3957) at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.processMisReplicatedBlock(BlockManager.java:3898) at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.rescanPostponedMisreplicatedBlocks(BlockManager.java:2898) at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager$RedundancyMonitor.run(BlockManager.java:5053) at java.lang.Thread.run(Thread.java:750) 2024-02-20 12:05:37,357 INFO org.apache.hadoop.util.ExitUtil: Exiting with status 1: java.lang.IllegalArgumentException: The EC replicas to be deleted are not in the candidate list {code} Looking through the code path itself, there is a check for `Preconditions.checkArgument()` to ensure that a given block chosen for deletion is actually one of the valid blocks. If not, then the NN shuts down. This is likely a symptom to a larger issue, such as how is a block being chosen that is not in the candidate list. The remainder of the cluster has services such as SPS and Balancer service disabled, so that the only movement of data should be whatever is "organically" chosen by the NameNode. -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org