[jira] [Updated] (HDDS-8459) Ratis under replication handling in a rack aware environment doesn't work

Siddhant Sangwan (Jira) Wed, 19 Apr 2023 05:03:06 -0700


     [ 
https://issues.apache.org/jira/browse/HDDS-8459?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Siddhant Sangwan updated HDDS-8459:
-----------------------------------
    Description: 
This is the rack aware environment defined in 
{{{}dist/target/ozone-1.4.0-SNAPSHOT/compose/ozone-topology{}}}. I additionally 
added the following configurations to enable the new ReplicationManager and 
ContainerScanner. The ContainerBalancer configurations shouldn't be relevant 
here.
{code:java}
OZONE-SITE.XML_hdds.scm.replication.enable.legacy=false
OZONE-SITE.XML_hdds.container.balancer.balancing.iteration.interval=5m
OZONE-SITE.XML_hdds.container.balancer.move.timeout=295s
OZONE-SITE.XML_hdds.container.balancer.move.replication.timeout=200s
OZONE-SITE.XML_hdds.scm.replication.thread.interval=100s
OZONE-SITE.XML_hdds.container.scrub.enabled=true
OZONE-SITE.XML_hdds.container.scrub.metadata.scan.interval=20s
OZONE-SITE.XML_hdds.container.scrub.data.scan.interval=20s
{code}
When I manually change the checksum of a container replica in a DN, the 
container scanner detects this and marks it UNHEALTHY. But RM is not able to 
handle this under replicated container.
EDIT: The stack trace looks slightly different on the latest apache master and 
is more helpful:
{code}
scm_1         | 2023-04-19 12:00:09,485 [Under Replicated Processor] ERROR 
replication.UnhealthyReplicationProcessor: Error processing Health result of 
class: class 
org.apache.hadoop.hdds.scm.container.replication.ContainerHealthResult$UnderReplicatedHealthResult
 for container ContainerInfo{id=#2, state=CLOSED, 
pipelineID=PipelineID=c273b63f-0d6d-4701-b333-c8bcf3e85ba6, 
stateEnterTime=2023-04-19T11:55:13.697Z, owner=om1}
scm_1         | org.apache.hadoop.hdds.scm.exceptions.SCMException: Placement 
Policy: class 
org.apache.hadoop.hdds.scm.container.placement.algorithms.SCMContainerPlacementRackAware
 did not return any nodes. Number of required Nodes 0, Datasize Required: 
998244352
scm_1         |         at 
org.apache.hadoop.hdds.scm.container.replication.ReplicationManagerUtil.getTargetDatanodes(ReplicationManagerUtil.java:87)
scm_1         |         at 
org.apache.hadoop.hdds.scm.container.replication.RatisUnderReplicationHandler.getTargets(RatisUnderReplicationHandler.java:243)
scm_1         |         at 
org.apache.hadoop.hdds.scm.container.replication.RatisUnderReplicationHandler.processAndSendCommands(RatisUnderReplicationHandler.java:111)
scm_1         |         at 
org.apache.hadoop.hdds.scm.container.replication.ReplicationManager.processUnderReplicatedContainer(ReplicationManager.java:819)
scm_1         |         at 
org.apache.hadoop.hdds.scm.container.replication.UnderReplicatedProcessor.sendDatanodeCommands(UnderReplicatedProcessor.java:53)
scm_1         |         at 
org.apache.hadoop.hdds.scm.container.replication.UnderReplicatedProcessor.sendDatanodeCommands(UnderReplicatedProcessor.java:27)
scm_1         |         at 
org.apache.hadoop.hdds.scm.container.replication.UnhealthyReplicationProcessor.processContainer(UnhealthyReplicationProcessor.java:127)
scm_1         |         at 
org.apache.hadoop.hdds.scm.container.replication.UnhealthyReplicationProcessor.processAll(UnhealthyReplicationProcessor.java:93)
scm_1         |         at 
org.apache.hadoop.hdds.scm.container.replication.UnhealthyReplicationProcessor.run(UnhealthyReplicationProcessor.java:136)
scm_1         |         at java.base/java.lang.Thread.run(Thread.java:829)
scm_1         | 2023-04-19 12:00:09,485 [Under Replicated Processor] INFO 
replication.UnhealthyReplicationProcessor: Processed 0 containers with health 
state counts {}, failed processing 1
{code}

  was:
This is the rack aware environment defined in 
{{{}dist/target/ozone-1.4.0-SNAPSHOT/compose/ozone-topology{}}}. I additionally 
added the following configurations to enable the new ReplicationManager and 
ContainerScanner. The ContainerBalancer configurations shouldn't be relevant 
here.
{code:java}
OZONE-SITE.XML_hdds.scm.replication.enable.legacy=false
OZONE-SITE.XML_hdds.container.balancer.balancing.iteration.interval=5m
OZONE-SITE.XML_hdds.container.balancer.move.timeout=295s
OZONE-SITE.XML_hdds.container.balancer.move.replication.timeout=200s
OZONE-SITE.XML_hdds.scm.replication.thread.interval=100s
OZONE-SITE.XML_hdds.container.scrub.enabled=true
OZONE-SITE.XML_hdds.container.scrub.metadata.scan.interval=20s
OZONE-SITE.XML_hdds.container.scrub.data.scan.interval=20s
{code}
When I manually change the checksum of a container replica in a DN, the 
container scanner detects this and marks it UNHEALTHY. But RM is not able to 
handle this under replicated container:
{code:java}
scm_1         | 2023-04-19 07:49:20 ERROR UnhealthyReplicationProcessor:98 - 
Error processing Health result of class: class 
org.apache.hadoop.hdds.scm.container.replication.ContainerHealthResult$UnderReplicatedHealthResult
 for container ContainerInfo{id=#10, state=CLOSED, 
pipelineID=PipelineID=4c45f6ae-ff08-4890-8f8a-9cb12cd16283, 
stateEnterTime=2023-04-19T07:31:55.110Z, owner=om1}
scm_1         | java.lang.IllegalArgumentException
scm_1         |         at 
com.google.common.base.Preconditions.checkArgument(Preconditions.java:131)
scm_1         |         at 
org.apache.hadoop.hdds.scm.container.placement.algorithms.SCMContainerPlacementRackAware.chooseDatanodesInternal(SCMContainerPlacementRackAware.java:107)
scm_1         |         at 
org.apache.hadoop.hdds.scm.SCMCommonPlacementPolicy.chooseDatanodes(SCMCommonPlacementPolicy.java:185)
scm_1         |         at 
org.apache.hadoop.hdds.scm.SCMCommonPlacementPolicy.chooseDatanodes(SCMCommonPlacementPolicy.java:127)
scm_1         |         at 
org.apache.hadoop.hdds.scm.container.replication.RatisUnderReplicationHandler.getTargets(RatisUnderReplicationHandler.java:238)
scm_1         |         at 
org.apache.hadoop.hdds.scm.container.replication.RatisUnderReplicationHandler.processAndSendCommands(RatisUnderReplicationHandler.java:110)
scm_1         |         at 
org.apache.hadoop.hdds.scm.container.replication.ReplicationManager.processUnderReplicatedContainer(ReplicationManager.java:819)
scm_1         |         at 
org.apache.hadoop.hdds.scm.container.replication.UnderReplicatedProcessor.sendDatanodeCommands(UnderReplicatedProcessor.java:53)
scm_1         |         at 
org.apache.hadoop.hdds.scm.container.replication.UnderReplicatedProcessor.sendDatanodeCommands(UnderReplicatedProcessor.java:27)
scm_1         |         at 
org.apache.hadoop.hdds.scm.container.replication.UnhealthyReplicationProcessor.processContainer(UnhealthyReplicationProcessor.java:127)
scm_1         |         at 
org.apache.hadoop.hdds.scm.container.replication.UnhealthyReplicationProcessor.processAll(UnhealthyReplicationProcessor.java:93)
scm_1         |         at 
org.apache.hadoop.hdds.scm.container.replication.UnhealthyReplicationProcessor.run(UnhealthyReplicationProcessor.java:136)
scm_1         |         at java.base/java.lang.Thread.run(Thread.java:829)
scm_1         | 2023-04-19 07:49:20 INFO  UnhealthyReplicationProcessor:110 - 
Processed 0 containers with health state counts {}, failed processing 1
{code}


> Ratis under replication handling in a rack aware environment doesn't work
> -------------------------------------------------------------------------
>
>                 Key: HDDS-8459
>                 URL: https://issues.apache.org/jira/browse/HDDS-8459
>             Project: Apache Ozone
>          Issue Type: Sub-task
>          Components: SCM
>            Reporter: Siddhant Sangwan
>            Assignee: Siddhant Sangwan
>            Priority: Major
>
> This is the rack aware environment defined in 
> {{{}dist/target/ozone-1.4.0-SNAPSHOT/compose/ozone-topology{}}}. I 
> additionally added the following configurations to enable the new 
> ReplicationManager and ContainerScanner. The ContainerBalancer configurations 
> shouldn't be relevant here.
> {code:java}
> OZONE-SITE.XML_hdds.scm.replication.enable.legacy=false
> OZONE-SITE.XML_hdds.container.balancer.balancing.iteration.interval=5m
> OZONE-SITE.XML_hdds.container.balancer.move.timeout=295s
> OZONE-SITE.XML_hdds.container.balancer.move.replication.timeout=200s
> OZONE-SITE.XML_hdds.scm.replication.thread.interval=100s
> OZONE-SITE.XML_hdds.container.scrub.enabled=true
> OZONE-SITE.XML_hdds.container.scrub.metadata.scan.interval=20s
> OZONE-SITE.XML_hdds.container.scrub.data.scan.interval=20s
> {code}
> When I manually change the checksum of a container replica in a DN, the 
> container scanner detects this and marks it UNHEALTHY. But RM is not able to 
> handle this under replicated container.
> EDIT: The stack trace looks slightly different on the latest apache master 
> and is more helpful:
> {code}
> scm_1         | 2023-04-19 12:00:09,485 [Under Replicated Processor] ERROR 
> replication.UnhealthyReplicationProcessor: Error processing Health result of 
> class: class 
> org.apache.hadoop.hdds.scm.container.replication.ContainerHealthResult$UnderReplicatedHealthResult
>  for container ContainerInfo{id=#2, state=CLOSED, 
> pipelineID=PipelineID=c273b63f-0d6d-4701-b333-c8bcf3e85ba6, 
> stateEnterTime=2023-04-19T11:55:13.697Z, owner=om1}
> scm_1         | org.apache.hadoop.hdds.scm.exceptions.SCMException: Placement 
> Policy: class 
> org.apache.hadoop.hdds.scm.container.placement.algorithms.SCMContainerPlacementRackAware
>  did not return any nodes. Number of required Nodes 0, Datasize Required: 
> 998244352
> scm_1         |       at 
> org.apache.hadoop.hdds.scm.container.replication.ReplicationManagerUtil.getTargetDatanodes(ReplicationManagerUtil.java:87)
> scm_1         |       at 
> org.apache.hadoop.hdds.scm.container.replication.RatisUnderReplicationHandler.getTargets(RatisUnderReplicationHandler.java:243)
> scm_1         |       at 
> org.apache.hadoop.hdds.scm.container.replication.RatisUnderReplicationHandler.processAndSendCommands(RatisUnderReplicationHandler.java:111)
> scm_1         |       at 
> org.apache.hadoop.hdds.scm.container.replication.ReplicationManager.processUnderReplicatedContainer(ReplicationManager.java:819)
> scm_1         |       at 
> org.apache.hadoop.hdds.scm.container.replication.UnderReplicatedProcessor.sendDatanodeCommands(UnderReplicatedProcessor.java:53)
> scm_1         |       at 
> org.apache.hadoop.hdds.scm.container.replication.UnderReplicatedProcessor.sendDatanodeCommands(UnderReplicatedProcessor.java:27)
> scm_1         |       at 
> org.apache.hadoop.hdds.scm.container.replication.UnhealthyReplicationProcessor.processContainer(UnhealthyReplicationProcessor.java:127)
> scm_1         |       at 
> org.apache.hadoop.hdds.scm.container.replication.UnhealthyReplicationProcessor.processAll(UnhealthyReplicationProcessor.java:93)
> scm_1         |       at 
> org.apache.hadoop.hdds.scm.container.replication.UnhealthyReplicationProcessor.run(UnhealthyReplicationProcessor.java:136)
> scm_1         |       at java.base/java.lang.Thread.run(Thread.java:829)
> scm_1         | 2023-04-19 12:00:09,485 [Under Replicated Processor] INFO 
> replication.UnhealthyReplicationProcessor: Processed 0 containers with health 
> state counts {}, failed processing 1
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (HDDS-8459) Ratis under replication handling in a rack aware environment doesn't work

Reply via email to