[ 
https://issues.apache.org/jira/browse/HDDS-15350?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18083613#comment-18083613
 ] 

Wei-Chiu Chuang commented on HDDS-15350:
----------------------------------------

This cluster has DNS resolution problem where it takes up to 10 seconds to get 
back a DNS lookup request. That appears to be the culprit.

I saw messages like below in SCM log:

{noformat}
2026-05-26 11:27:48,330 INFO 
[EventQueue-DeadNodeForDeadNodeHandler]-org.apache.hadoop.hdds.scm.node.DeadNodeHandler:
 A dead datanode is detected. bf0ebee8-
d060-405c-9089-d9fbaf5d649b(ve1128.halxg.cloudera.com/10.17.246.38)
2026-05-26 11:27:48,344 INFO 
[EventQueue-DeadNodeForDeadNodeHandler]-org.apache.hadoop.hdds.scm.node.DeadNodeHandler:
 Clearing command queue of size 164 for
 DN bf0ebee8-d060-405c-9089-d9fbaf5d649b(ve1128.halxg.cloudera.com/10.17.246.38)
2026-05-26 11:27:48,344 INFO 
[EventQueue-DeadNodeForDeadNodeHandler]-org.apache.hadoop.hdds.scm.net.NetworkTopologyImpl:
 Removed a node: /default/bf0ebee8-d
060-405c-9089-d9fbaf5d649b
{noformat}

> Divide by zero bug crashed SCM when decommissioning a datanode
> --------------------------------------------------------------
>
>                 Key: HDDS-15350
>                 URL: https://issues.apache.org/jira/browse/HDDS-15350
>             Project: Apache Ozone
>          Issue Type: Bug
>          Components: SCM
>            Reporter: Wei-Chiu Chuang
>            Assignee: Siyao Meng
>            Priority: Major
>              Labels: pull-request-available
>
> Encountered an interesting bug:
>  
> {noformat}
> 2026-05-22 16:23:39,439 ERROR 
> [ReplicationMonitor]-org.apache.hadoop.hdds.scm.container.replication.ReplicationManager:
>  Exception in Replication Monitor Thr
> ead.
> java.lang.ArithmeticException: / by zero
>         at 
> org.apache.hadoop.hdds.scm.SCMCommonPlacementPolicy.getMaxReplicasPerRack(SCMCommonPlacementPolicy.java:419)
>         at 
> org.apache.hadoop.hdds.scm.SCMCommonPlacementPolicy.validateContainerPlacement(SCMCommonPlacementPolicy.java:466)
>         at 
> org.apache.hadoop.hdds.scm.container.replication.health.ECMisReplicationCheckHandler.getPlacementStatus(ECMisReplicationCheckHandler.java:138)
>         at 
> org.apache.hadoop.hdds.scm.container.replication.health.ECMisReplicationCheckHandler.checkMisReplication(ECMisReplicationCheckHandler.java:93)
>         at 
> org.apache.hadoop.hdds.scm.container.replication.health.ECMisReplicationCheckHandler.handle(ECMisReplicationCheckHandler.java:69)
>         at 
> org.apache.hadoop.hdds.scm.container.replication.health.AbstractCheck.handleChain(AbstractCheck.java:38)
>         at 
> org.apache.hadoop.hdds.scm.container.replication.health.AbstractCheck.handleChain(AbstractCheck.java:40)
>         at 
> org.apache.hadoop.hdds.scm.container.replication.health.AbstractCheck.handleChain(AbstractCheck.java:40)
>         at 
> org.apache.hadoop.hdds.scm.container.replication.health.AbstractCheck.handleChain(AbstractCheck.java:40)
>         at 
> org.apache.hadoop.hdds.scm.container.replication.health.AbstractCheck.handleChain(AbstractCheck.java:40)
>         at 
> org.apache.hadoop.hdds.scm.container.replication.health.AbstractCheck.handleChain(AbstractCheck.java:40)
>         at 
> org.apache.hadoop.hdds.scm.container.replication.health.AbstractCheck.handleChain(AbstractCheck.java:40)
>         at 
> org.apache.hadoop.hdds.scm.container.replication.health.AbstractCheck.handleChain(AbstractCheck.java:40)
>         at 
> org.apache.hadoop.hdds.scm.container.replication.health.AbstractCheck.handleChain(AbstractCheck.java:40)
>         at 
> org.apache.hadoop.hdds.scm.container.replication.health.AbstractCheck.handleChain(AbstractCheck.java:40)
>         at 
> org.apache.hadoop.hdds.scm.container.replication.health.AbstractCheck.handleChain(AbstractCheck.java:40)
>         at 
> org.apache.hadoop.hdds.scm.container.replication.ReplicationManager.processContainer(ReplicationManager.java:899)
>         at 
> org.apache.hadoop.hdds.scm.container.replication.ReplicationManager.processContainer(ReplicationManager.java:872)
>         at 
> org.apache.hadoop.hdds.scm.container.replication.ReplicationManager.processAll(ReplicationManager.java:399)
>         at 
> org.apache.hadoop.hdds.scm.container.replication.ReplicationManager.run(ReplicationManager.java:953)
>         at java.lang.Thread.run(Thread.java:748)
> 2026-05-22 16:23:39,442 INFO 
> [ReplicationMonitor]-org.apache.hadoop.util.ExitUtil: Exiting with status 1: 
> java.lang.ArithmeticException: / by zero {noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to