Christos Bisias created HDDS-9645:
-------------------------------------
Summary: SCM and Recon are inconsistent in excluding
out-of-service nodes when checking for healthy containers
Key: HDDS-9645
URL: https://issues.apache.org/jira/browse/HDDS-9645
Project: Apache Ozone
Issue Type: Bug
Reporter: Christos Bisias
Assignee: Christos Bisias
Attachments: image-2023-11-07-17-47-14-250.png
When SCM checks for over-replication or under-replication, it doesn’t count
replicas that belong to datanodes that are decommissioned or in maintenance.
But it checks these datanodes when testing for mis-replication.
Recon counts replicas belonging to datanodes that are decommissioned or in
maintenance, in all above cases.
We should exclude these datanodes
* to be consistent
* because replicas belonging to out-of-service nodes are not actually available
To reproduce the issue
* /hadoop-ozone/dist/target/ozone-1.4.0-SNAPSHOT/compose/ozone
* Edit *docker-config* and add these two configs to decommission datanodes
**
{code:java}
OZONE-SITE.XML_ozone.scm.nodes.scmservice=scm
OZONE-SITE.XML_ozone.scm.address.scmservice.scm=scm
{code}
* Start the docker env, create a key with replication RATIS 3
**
{code:java}
> docker-compose up --scale datanode=3 -d
> docker-compose exec scm bash
bash-4.2$ ozone sh volume create /vol1
bash-4.2$ ozone sh bucket create /vol1/bucket1
bash-4.2$ ozone sh key put /vol1/bucket1/key1 /etc/hosts -t=RATIS -r=THREE
{code}
* Decommission 2/3 datanodes that have the container replicas
**
{code:java}
bash-4.2$ ozone admin container info 1
get 2/3 datanodes
bash-4.2$ ozone admin scm roles
copy SCM IP
bash-4.2$ ozone admin datanode list
copy datanode IPs
bash-4.2$ ozone admin datanode decommission -id=scmservice
--scm=172.23.0.2:9894 172.23.0.8/ozone-datanode-2.ozone_default
Started decommissioning datanode(s):
172.23.0.8/ozone-datanode-2.ozone_default
bash-4.2$ ozone admin datanode decommission -id=scmservice
--scm=172.23.0.2:9894 172.23.0.11/ozone-datanode-1.ozone_default
Started decommissioning datanode(s):
172.23.0.11/ozone-datanode-1.ozone_default{code}
* After the nodes have successfully being decommissioned
** SCM container report
***
{code:java}
bash-4.2$ ozone admin container report
Container Summary Report generated at 2023-11-07T15:37:38Z
==========================================================
Container State Summary
=======================
OPEN: 0
CLOSING: 0
QUASI_CLOSED: 0
CLOSED: 1
DELETING: 0
DELETED: 0
RECOVERING: 0
Container Health Summary
========================
UNDER_REPLICATED: 0
MIS_REPLICATED: 0
OVER_REPLICATED: 0
MISSING: 0
UNHEALTHY: 0
EMPTY: 0
OPEN_UNHEALTHY: 0
QUASI_CLOSED_STUCK: 0 {code}
** Recon container page
*** The container appears as over-replicated indefinetely
*** Two replicas have been created in new datanodes but Recon reports that we
expect 3 replicas but actually have 5. It's counting the replicas on the
out-of-service nodes as well
!image-2023-11-07-17-47-14-250.png|width=387,height=238!
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]