[
https://issues.apache.org/jira/browse/HDDS-15350?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18083613#comment-18083613
]
Wei-Chiu Chuang commented on HDDS-15350:
----------------------------------------
This cluster has DNS resolution problem where it takes up to 10 seconds to get
back a DNS lookup request. That appears to be the culprit.
I saw messages like below in SCM log:
{noformat}
2026-05-26 11:27:48,330 INFO
[EventQueue-DeadNodeForDeadNodeHandler]-org.apache.hadoop.hdds.scm.node.DeadNodeHandler:
A dead datanode is detected. bf0ebee8-
d060-405c-9089-d9fbaf5d649b(ve1128.halxg.cloudera.com/10.17.246.38)
2026-05-26 11:27:48,344 INFO
[EventQueue-DeadNodeForDeadNodeHandler]-org.apache.hadoop.hdds.scm.node.DeadNodeHandler:
Clearing command queue of size 164 for
DN bf0ebee8-d060-405c-9089-d9fbaf5d649b(ve1128.halxg.cloudera.com/10.17.246.38)
2026-05-26 11:27:48,344 INFO
[EventQueue-DeadNodeForDeadNodeHandler]-org.apache.hadoop.hdds.scm.net.NetworkTopologyImpl:
Removed a node: /default/bf0ebee8-d
060-405c-9089-d9fbaf5d649b
{noformat}
> Divide by zero bug crashed SCM when decommissioning a datanode
> --------------------------------------------------------------
>
> Key: HDDS-15350
> URL: https://issues.apache.org/jira/browse/HDDS-15350
> Project: Apache Ozone
> Issue Type: Bug
> Components: SCM
> Reporter: Wei-Chiu Chuang
> Assignee: Siyao Meng
> Priority: Major
> Labels: pull-request-available
>
> Encountered an interesting bug:
>
> {noformat}
> 2026-05-22 16:23:39,439 ERROR
> [ReplicationMonitor]-org.apache.hadoop.hdds.scm.container.replication.ReplicationManager:
> Exception in Replication Monitor Thr
> ead.
> java.lang.ArithmeticException: / by zero
> at
> org.apache.hadoop.hdds.scm.SCMCommonPlacementPolicy.getMaxReplicasPerRack(SCMCommonPlacementPolicy.java:419)
> at
> org.apache.hadoop.hdds.scm.SCMCommonPlacementPolicy.validateContainerPlacement(SCMCommonPlacementPolicy.java:466)
> at
> org.apache.hadoop.hdds.scm.container.replication.health.ECMisReplicationCheckHandler.getPlacementStatus(ECMisReplicationCheckHandler.java:138)
> at
> org.apache.hadoop.hdds.scm.container.replication.health.ECMisReplicationCheckHandler.checkMisReplication(ECMisReplicationCheckHandler.java:93)
> at
> org.apache.hadoop.hdds.scm.container.replication.health.ECMisReplicationCheckHandler.handle(ECMisReplicationCheckHandler.java:69)
> at
> org.apache.hadoop.hdds.scm.container.replication.health.AbstractCheck.handleChain(AbstractCheck.java:38)
> at
> org.apache.hadoop.hdds.scm.container.replication.health.AbstractCheck.handleChain(AbstractCheck.java:40)
> at
> org.apache.hadoop.hdds.scm.container.replication.health.AbstractCheck.handleChain(AbstractCheck.java:40)
> at
> org.apache.hadoop.hdds.scm.container.replication.health.AbstractCheck.handleChain(AbstractCheck.java:40)
> at
> org.apache.hadoop.hdds.scm.container.replication.health.AbstractCheck.handleChain(AbstractCheck.java:40)
> at
> org.apache.hadoop.hdds.scm.container.replication.health.AbstractCheck.handleChain(AbstractCheck.java:40)
> at
> org.apache.hadoop.hdds.scm.container.replication.health.AbstractCheck.handleChain(AbstractCheck.java:40)
> at
> org.apache.hadoop.hdds.scm.container.replication.health.AbstractCheck.handleChain(AbstractCheck.java:40)
> at
> org.apache.hadoop.hdds.scm.container.replication.health.AbstractCheck.handleChain(AbstractCheck.java:40)
> at
> org.apache.hadoop.hdds.scm.container.replication.health.AbstractCheck.handleChain(AbstractCheck.java:40)
> at
> org.apache.hadoop.hdds.scm.container.replication.health.AbstractCheck.handleChain(AbstractCheck.java:40)
> at
> org.apache.hadoop.hdds.scm.container.replication.ReplicationManager.processContainer(ReplicationManager.java:899)
> at
> org.apache.hadoop.hdds.scm.container.replication.ReplicationManager.processContainer(ReplicationManager.java:872)
> at
> org.apache.hadoop.hdds.scm.container.replication.ReplicationManager.processAll(ReplicationManager.java:399)
> at
> org.apache.hadoop.hdds.scm.container.replication.ReplicationManager.run(ReplicationManager.java:953)
> at java.lang.Thread.run(Thread.java:748)
> 2026-05-22 16:23:39,442 INFO
> [ReplicationMonitor]-org.apache.hadoop.util.ExitUtil: Exiting with status 1:
> java.lang.ArithmeticException: / by zero {noformat}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]