Siddhant Sangwan created HDDS-8459:
--------------------------------------
Summary: Ratis under replication handling in a rack aware
environment doesn't work
Key: HDDS-8459
URL: https://issues.apache.org/jira/browse/HDDS-8459
Project: Apache Ozone
Issue Type: Sub-task
Components: SCM
Reporter: Siddhant Sangwan
Assignee: Siddhant Sangwan
This is the rack aware environment defined in
{{{}dist/target/ozone-1.4.0-SNAPSHOT/compose/ozone-topology{}}}. I additionally
added the following configurations to enable the new ReplicationManager and
ContainerScanner. The ContainerBalancer configurations shouldn't be relevant
here.
{code:java}
OZONE-SITE.XML_hdds.scm.replication.enable.legacy=false
OZONE-SITE.XML_hdds.container.balancer.balancing.iteration.interval=5m
OZONE-SITE.XML_hdds.container.balancer.move.timeout=295s
OZONE-SITE.XML_hdds.container.balancer.move.replication.timeout=200s
OZONE-SITE.XML_hdds.scm.replication.thread.interval=100s
OZONE-SITE.XML_hdds.container.scrub.enabled=true
OZONE-SITE.XML_hdds.container.scrub.metadata.scan.interval=20s
OZONE-SITE.XML_hdds.container.scrub.data.scan.interval=20s
{code}
When I manually change the checksum of a container replica in a DN, the
container scanner detects this and marks it UNHEALTHY. But RM is not able to
handle this under replicated container:
{code:java}
scm_1 | 2023-04-19 07:49:20 ERROR UnhealthyReplicationProcessor:98 -
Error processing Health result of class: class
org.apache.hadoop.hdds.scm.container.replication.ContainerHealthResult$UnderReplicatedHealthResult
for container ContainerInfo{id=#10, state=CLOSED,
pipelineID=PipelineID=4c45f6ae-ff08-4890-8f8a-9cb12cd16283,
stateEnterTime=2023-04-19T07:31:55.110Z, owner=om1}
scm_1 | java.lang.IllegalArgumentException
scm_1 | at
com.google.common.base.Preconditions.checkArgument(Preconditions.java:131)
scm_1 | at
org.apache.hadoop.hdds.scm.container.placement.algorithms.SCMContainerPlacementRackAware.chooseDatanodesInternal(SCMContainerPlacementRackAware.java:107)
scm_1 | at
org.apache.hadoop.hdds.scm.SCMCommonPlacementPolicy.chooseDatanodes(SCMCommonPlacementPolicy.java:185)
scm_1 | at
org.apache.hadoop.hdds.scm.SCMCommonPlacementPolicy.chooseDatanodes(SCMCommonPlacementPolicy.java:127)
scm_1 | at
org.apache.hadoop.hdds.scm.container.replication.RatisUnderReplicationHandler.getTargets(RatisUnderReplicationHandler.java:238)
scm_1 | at
org.apache.hadoop.hdds.scm.container.replication.RatisUnderReplicationHandler.processAndSendCommands(RatisUnderReplicationHandler.java:110)
scm_1 | at
org.apache.hadoop.hdds.scm.container.replication.ReplicationManager.processUnderReplicatedContainer(ReplicationManager.java:819)
scm_1 | at
org.apache.hadoop.hdds.scm.container.replication.UnderReplicatedProcessor.sendDatanodeCommands(UnderReplicatedProcessor.java:53)
scm_1 | at
org.apache.hadoop.hdds.scm.container.replication.UnderReplicatedProcessor.sendDatanodeCommands(UnderReplicatedProcessor.java:27)
scm_1 | at
org.apache.hadoop.hdds.scm.container.replication.UnhealthyReplicationProcessor.processContainer(UnhealthyReplicationProcessor.java:127)
scm_1 | at
org.apache.hadoop.hdds.scm.container.replication.UnhealthyReplicationProcessor.processAll(UnhealthyReplicationProcessor.java:93)
scm_1 | at
org.apache.hadoop.hdds.scm.container.replication.UnhealthyReplicationProcessor.run(UnhealthyReplicationProcessor.java:136)
scm_1 | at java.base/java.lang.Thread.run(Thread.java:829)
scm_1 | 2023-04-19 07:49:20 INFO UnhealthyReplicationProcessor:110 -
Processed 0 containers with health state counts {}, failed processing 1
{code}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]