xBis7 opened a new pull request, #5726: URL: https://github.com/apache/ozone/pull/5726
## What changes were proposed in this pull request? When a node is decommissioning, new replicas are being copied to other nodes and once this process has finished, then the node goes into decommission. After the copies are made, the container appears as mis-replicated due to the excessive replicas. These replicas are unavailable and the decommissioned node is expected to be stopped. For that reason, containers that belong to decommissioning or decommissioned nodes, shouldn't be counted as mis-replicated. Nodes in maintenance, won't be filtered because such nodes are expected to come back and no replica copies are made while entering the state. Mis-replication on a node in maintenance, is the same as having mis-replication on a healthy and active node. ## What is the link to the Apache JIRA https://issues.apache.org/jira/browse/HDDS-9683 ## How was this patch tested? New unit tests are added. It can also be tested manually with the `ozone-topology` docker env like so * Edit `docker-config` to enable RackScatter policy (easier to reproduce with RackScatter) * ```diff - OZONE-SITE.XML_ozone.scm.container.placement.impl=org.apache.hadoop.hdds.scm.container.placement.algorithms.SCMContainerPlacementRackAware + #OZONE-SITE.XML_ozone.scm.container.placement.impl=org.apache.hadoop.hdds.scm.container.placement.algorithms.SCMContainerPlacementRackAware + + OZONE-SITE.XML_ozone.scm.container.placement.impl=org.apache.hadoop.hdds.scm.container.placement.algorithms.SCMContainerPlacementRackScatter + OZONE-SITE.XML_ozone.scm.pipeline.placement.impl=org.apache.hadoop.hdds.scm.container.placement.algorithms.SCMContainerPlacementRackScatter + + # For decommission + OZONE-SITE.XML_ozone.scm.nodes.scmservice=scm + OZONE-SITE.XML_ozone.scm.address.scmservice.scm=scm + + # Expedite the container replication checking + OZONE-SITE.XML_hdds.scm.replication.thread.interval=15s ``` * Edit `network-config` * ```diff - 10.5.0.6 /rack1 + 10.5.0.6 /rack2 10.5.0.7 /rack2 - 10.5.0.8 /rack2 + 10.5.0.8 /rack3 - 10.5.0.9 /rack2 + 10.5.0.9 /rack3 ``` * Create a key with replication Ratis THREE * `ozone sh key put /vol1/bucket1/key1 /etc/hosts -t=RATIS -r=THREE` * Find the replicas for container 1 and decommission one of the replica nodes * ``` bash-4.2$ ozone admin container info 1 get 1 datanode bash-4.2$ ozone admin scm roles copy SCM IP bash-4.2$ ozone admin datanode list copy datanode IP bash-4.2$ ozone admin datanode decommission -id=scmservice --scm=172.23.0.2:9894 172.23.0.8/ozone-datanode-2.ozone_default Started decommissioning datanode(s): 172.23.0.8/ozone-datanode-2.ozone_default ``` * Once the node is decommissioned, check the SCM container report `ozone admin container report` and check Recon container page. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
