[jira] [Updated] (HDDS-9645) Recon doesn't exclude out-of-service nodes when checking for healthy containers

Christos Bisias (Jira) Tue, 07 Nov 2023 07:55:06 -0800


     [ 
https://issues.apache.org/jira/browse/HDDS-9645?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Christos Bisias updated HDDS-9645:
----------------------------------
    Description: 
When SCM checks for over-replication or under-replication, it doesn’t count 
replicas that belong to datanodes that are decommissioned or in maintenance. 
But it checks these datanodes when testing for mis-replication.

Recon counts replicas belonging to datanodes that are decommissioned or in 
maintenance, in all above cases. 

We should exclude these datanodes
 * to be consistent

 * because replicas belonging to out-of-service nodes are not actually available

 

To reproduce the issue
 * /hadoop-ozone/dist/target/ozone-1.4.0-SNAPSHOT/compose/ozone
 * Edit *docker-config* and add these two configs to decommission datanodes
 ** 
{code:java}
OZONE-SITE.XML_ozone.scm.nodes.scmservice=scm
OZONE-SITE.XML_ozone.scm.address.scmservice.scm=scm
{code}
 

 * Start the docker env, create a key with replication RATIS 3
 ** 
{code:java}
> docker-compose up --scale datanode=3 -d
> docker-compose exec scm bash
bash-4.2$ ozone sh volume create /vol1
bash-4.2$ ozone sh bucket create /vol1/bucket1
bash-4.2$ ozone sh key put /vol1/bucket1/key1 /etc/hosts -t=RATIS -r=THREE 
{code}

 * Decommission 2/3 datanodes that have the container replicas
 ** 
{code:java}
bash-4.2$ ozone admin container info 1
     get 2/3 datanodes
bash-4.2$ ozone admin scm roles
     copy SCM IP
bash-4.2$ ozone admin datanode list
     copy datanode IPs
bash-4.2$ ozone admin datanode decommission -id=scmservice 
--scm=172.23.0.2:9894 172.23.0.8/ozone-datanode-2.ozone_default
Started decommissioning datanode(s):
172.23.0.8/ozone-datanode-2.ozone_default
bash-4.2$ ozone admin datanode decommission -id=scmservice 
--scm=172.23.0.2:9894 172.23.0.11/ozone-datanode-1.ozone_default
Started decommissioning datanode(s):
172.23.0.11/ozone-datanode-1.ozone_default{code}

 * After the nodes have successfully being decommissioned
 ** SCM container report
 *** 
{code:java}
bash-4.2$ ozone admin container report
Container Summary Report generated at 2023-11-07T15:37:38Z
==========================================================


Container State Summary
=======================
OPEN: 0
CLOSING: 0
QUASI_CLOSED: 0
CLOSED: 1
DELETING: 0
DELETED: 0
RECOVERING: 0


Container Health Summary
========================
UNDER_REPLICATED: 0
MIS_REPLICATED: 0
OVER_REPLICATED: 0
MISSING: 0
UNHEALTHY: 0
EMPTY: 0
OPEN_UNHEALTHY: 0
QUASI_CLOSED_STUCK: 0 {code}

 * 
 ** Recon container page
 *** The container appears as over-replicated indefinetely
 *** Two replicas have been created in new datanodes but Recon reports that we 
expect 3 replicas but actually have 5. It's counting the replicas on the 
out-of-service nodes as well 

!image-2023-11-07-17-47-14-250.png|width=387,height=238!

  was:
When SCM checks for over-replication or under-replication, it doesn’t count 
replicas that belong to datanodes that are decommissioned or in maintenance. 
But it checks these datanodes when testing for mis-replication.

Recon counts replicas belonging to datanodes that are decommissioned or in 
maintenance, in all above cases. 

We should exclude these datanodes
 * to be consistent

 * because replicas belonging to out-of-service nodes are not actually available

 

To reproduce the issue
 * /hadoop-ozone/dist/target/ozone-1.4.0-SNAPSHOT/compose/ozone
 * Edit *docker-config* and add these two configs to decommission datanodes
 ** 
{code:java}
OZONE-SITE.XML_ozone.scm.nodes.scmservice=scm
OZONE-SITE.XML_ozone.scm.address.scmservice.scm=scm
{code}
 

 * Start the docker env, create a key with replication RATIS 3
 ** 
{code:java}
> docker-compose up --scale datanode=3 -d
> docker-compose exec scm bash
bash-4.2$ ozone sh volume create /vol1
bash-4.2$ ozone sh bucket create /vol1/bucket1
bash-4.2$ ozone sh key put /vol1/bucket1/key1 /etc/hosts -t=RATIS -r=THREE 
{code}

 * Decommission 2/3 datanodes that have the container replicas
 ** 
{code:java}
bash-4.2$ ozone admin container info 1
     get 2/3 datanodes
bash-4.2$ ozone admin scm roles
     copy SCM IP
bash-4.2$ ozone admin datanode list
     copy datanode IPs
bash-4.2$ ozone admin datanode decommission -id=scmservice 
--scm=172.23.0.2:9894 172.23.0.8/ozone-datanode-2.ozone_default
Started decommissioning datanode(s):
172.23.0.8/ozone-datanode-2.ozone_default
bash-4.2$ ozone admin datanode decommission -id=scmservice 
--scm=172.23.0.2:9894 172.23.0.11/ozone-datanode-1.ozone_default
Started decommissioning datanode(s):
172.23.0.11/ozone-datanode-1.ozone_default{code}

 * After the nodes have successfully being decommissioned
 ** SCM container report
 *** 
{code:java}
bash-4.2$ ozone admin container report
Container Summary Report generated at 2023-11-07T15:37:38Z
==========================================================


Container State Summary
=======================
OPEN: 0
CLOSING: 0
QUASI_CLOSED: 0
CLOSED: 1
DELETING: 0
DELETED: 0
RECOVERING: 0


Container Health Summary
========================
UNDER_REPLICATED: 0
MIS_REPLICATED: 0
OVER_REPLICATED: 0
MISSING: 0
UNHEALTHY: 0
EMPTY: 0
OPEN_UNHEALTHY: 0
QUASI_CLOSED_STUCK: 0 {code}

 ** Recon container page
 *** The container appears as over-replicated indefinetely
 *** Two replicas have been created in new datanodes but Recon reports that we 
expect 3 replicas but actually have 5. It's counting the replicas on the 
out-of-service nodes as well 

!image-2023-11-07-17-47-14-250.png|width=387,height=238!


> Recon doesn't exclude out-of-service nodes when checking for healthy 
> containers
> -------------------------------------------------------------------------------
>
>                 Key: HDDS-9645
>                 URL: https://issues.apache.org/jira/browse/HDDS-9645
>             Project: Apache Ozone
>          Issue Type: Bug
>          Components: Ozone Recon
>            Reporter: Christos Bisias
>            Assignee: Christos Bisias
>            Priority: Major
>         Attachments: image-2023-11-07-17-47-14-250.png
>
>
> When SCM checks for over-replication or under-replication, it doesn’t count 
> replicas that belong to datanodes that are decommissioned or in maintenance. 
> But it checks these datanodes when testing for mis-replication.
> Recon counts replicas belonging to datanodes that are decommissioned or in 
> maintenance, in all above cases. 
> We should exclude these datanodes
>  * to be consistent
>  * because replicas belonging to out-of-service nodes are not actually 
> available
>  
> To reproduce the issue
>  * /hadoop-ozone/dist/target/ozone-1.4.0-SNAPSHOT/compose/ozone
>  * Edit *docker-config* and add these two configs to decommission datanodes
>  ** 
> {code:java}
> OZONE-SITE.XML_ozone.scm.nodes.scmservice=scm
> OZONE-SITE.XML_ozone.scm.address.scmservice.scm=scm
> {code}
>  
>  * Start the docker env, create a key with replication RATIS 3
>  ** 
> {code:java}
> > docker-compose up --scale datanode=3 -d
> > docker-compose exec scm bash
> bash-4.2$ ozone sh volume create /vol1
> bash-4.2$ ozone sh bucket create /vol1/bucket1
> bash-4.2$ ozone sh key put /vol1/bucket1/key1 /etc/hosts -t=RATIS -r=THREE 
> {code}
>  * Decommission 2/3 datanodes that have the container replicas
>  ** 
> {code:java}
> bash-4.2$ ozone admin container info 1
>      get 2/3 datanodes
> bash-4.2$ ozone admin scm roles
>      copy SCM IP
> bash-4.2$ ozone admin datanode list
>      copy datanode IPs
> bash-4.2$ ozone admin datanode decommission -id=scmservice 
> --scm=172.23.0.2:9894 172.23.0.8/ozone-datanode-2.ozone_default
> Started decommissioning datanode(s):
> 172.23.0.8/ozone-datanode-2.ozone_default
> bash-4.2$ ozone admin datanode decommission -id=scmservice 
> --scm=172.23.0.2:9894 172.23.0.11/ozone-datanode-1.ozone_default
> Started decommissioning datanode(s):
> 172.23.0.11/ozone-datanode-1.ozone_default{code}
>  * After the nodes have successfully being decommissioned
>  ** SCM container report
>  *** 
> {code:java}
> bash-4.2$ ozone admin container report
> Container Summary Report generated at 2023-11-07T15:37:38Z
> ==========================================================
> Container State Summary
> =======================
> OPEN: 0
> CLOSING: 0
> QUASI_CLOSED: 0
> CLOSED: 1
> DELETING: 0
> DELETED: 0
> RECOVERING: 0
> Container Health Summary
> ========================
> UNDER_REPLICATED: 0
> MIS_REPLICATED: 0
> OVER_REPLICATED: 0
> MISSING: 0
> UNHEALTHY: 0
> EMPTY: 0
> OPEN_UNHEALTHY: 0
> QUASI_CLOSED_STUCK: 0 {code}
>  * 
>  ** Recon container page
>  *** The container appears as over-replicated indefinetely
>  *** Two replicas have been created in new datanodes but Recon reports that 
> we expect 3 replicas but actually have 5. It's counting the replicas on the 
> out-of-service nodes as well 
> !image-2023-11-07-17-47-14-250.png|width=387,height=238!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (HDDS-9645) Recon doesn't exclude out-of-service nodes when checking for healthy containers

Reply via email to