Rick Weber created HDFS-17392:
---------------------------------

             Summary: NameNode rolls frequently with "EC replicas to be deleted 
are not in the candidate" error
                 Key: HDFS-17392
                 URL: https://issues.apache.org/jira/browse/HDFS-17392
             Project: Hadoop HDFS
          Issue Type: Bug
          Components: namenode
    Affects Versions: 3.3.6
            Reporter: Rick Weber


Recently upgraded my clusters from Hadoop v3.3.4 to Hadoop v3.3.6 and noticed a 
lot of Namenode instability.  Basically after about 1 hour, the active namenode 
shuts down and the "next" one takes over.

Looking into the shutdown reasons, I'm seeing errors similar to
{code:java}
2024-02-20 12:05:37,352 INFO 
org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Rescan of 
postponedMisreplicatedBlocks completed in 8 msecs. 6639943 blocks are left. 1 
blocks were removed.
2024-02-20 12:05:37,352 ERROR 
org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: RedundancyMonitor 
thread received Runtime exception.
java.lang.IllegalArgumentException: The EC replicas to be deleted are not in 
the candidate list
    at 
org.apache.hadoop.thirdparty.com.google.common.base.Preconditions.checkArgument(Preconditions.java:144)
    at 
org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseExcessRedundancyStriped(BlockManager.java:4082)
    at 
org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseExcessRedundancies(BlockManager.java:3970)
    at 
org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.processExtraRedundancyBlock(BlockManager.java:3957)
    at 
org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.processMisReplicatedBlock(BlockManager.java:3898)
    at 
org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.rescanPostponedMisreplicatedBlocks(BlockManager.java:2898)
    at 
org.apache.hadoop.hdfs.server.blockmanagement.BlockManager$RedundancyMonitor.run(BlockManager.java:5053)
    at java.lang.Thread.run(Thread.java:750)
2024-02-20 12:05:37,357 INFO org.apache.hadoop.util.ExitUtil: Exiting with 
status 1: java.lang.IllegalArgumentException: The EC replicas to be deleted are 
not in the candidate list {code}
Looking through the code path itself, there is a check for 
`Preconditions.checkArgument()` to ensure that a given block chosen for 
deletion is actually one of the valid blocks.  If not, then the NN shuts down.

This is likely a symptom to a larger issue, such as how is a block being chosen 
that is not in the candidate list.

The remainder of the cluster has services such as SPS and Balancer service 
disabled, so that the only movement of data should be whatever is "organically" 
chosen by the NameNode.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org

Reply via email to