[ 
https://issues.apache.org/jira/browse/HDDS-11389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17877739#comment-17877739
 ] 

Arafat Khan commented on HDDS-11389:
------------------------------------

The root cause of this issue lies in understanding how container states 
transition from *CLOSED → DELETING →* *DELETED* in Recon.
 * In Recon, the *{{ContainerHealthTask}}* is responsible for identifying which 
containers are *DELETED* and facilitating their removal from the list 
maintained by the {*}Recon Container Manager{*}. 
 * The *{{containerDeletedInSCM}}* method within the *{{ContainerHealthTask}}* 
makes an RPC call  to the SCM to fetch information about containers that were 
previously in an unhealthy state. These unhealthy states include 
{*}{{MISSING}}{*}, {*}{{MIS_REPLICATED}}{*}, {*}{{UNDER_REPLICATED}}{*}, & 
{{{*}OVER_REPLICATED{*}.}}
 * For containers in these specific states, we check with SCM to see if they 
have been marked as *DELETED* and, if so, update the status on the Recon side 
accordingly.
 * The problem arises with containers in the *{{MISSING_EMPTY}}* state, a state 
that was recently introduced. These containers are considered "missing" because 
they have no reported replicas but are also deemed "{*}empty{*}" as they have 
no keys mapped to them.

The *{{retainOrUpdateRecord}}* method is called every time before making the 
SCM RPC call to check if the container is in one of the specified states 
({*}{{MISSING}}{*}, {*}{{MIS_REPLICATED}}{*}, {*}{{UNDER_REPLICATED}}{*}, or 
{*}{{OVER_REPLICATED}}{*}). If the container is in one of these states, the 
method returns {{{}true{}}}. If not, it returns {{{}false{}}}. Consequently, in 
the case of {*}{{MISSING_EMPTY}}{*}, the method will always return {{false}} 
because this state is not included in the checks.
 * As a result, even though these containers may have been deleted by the SCM, 
we never initiate the call to check, since the *{{retainOrUpdateRecord}}* 
method will always return {{false}} due to the absence of a check for 
*{{MISSING_EMPTY}}* in the switch case.

{code:java}
public static boolean retainOrUpdateRecord(
    ContainerHealthStatus container, UnhealthyContainersRecord rec) {
  boolean returnValue = false;
  switch (UnHealthyContainerStates.valueOf(rec.getContainerState())) {
    case MISSING:
      returnValue = container.isMissing() && !container.isEmpty();
      break;
    case MIS_REPLICATED:
      returnValue = keepMisReplicatedRecord(container, rec);
      break;
    case UNDER_REPLICATED:
      returnValue = keepUnderReplicatedRecord(container, rec);
      break;
    case OVER_REPLICATED:
      returnValue = keepOverReplicatedRecord(container, rec);
      break;
    default:
      returnValue = false;
  }
  return returnValue;
} {code}
Code calls *retainOrUpdateRecord* before calling *containerDeletedInSCM* :- 
[https://github.com/apache/ozone/blob/f22c6f8dfcc3e2ac822189e207d4cc85fc6fc490/hadoop-ozone/recon/src/main/java/org/apache/hadoop/ozone/recon/fsck/ContainerHealthTask.java#L286]
{code:java}
if (ContainerHealthRecords.retainOrUpdateRecord(currentContainer, rec)) {
  // Check if the missing container is deleted in SCM
  if (currentContainer.isMissing() &&
      containerDeletedInSCM(currentContainer.getContainer())) {
    rec.delete();
  }
  existingRecords.add(rec.getContainerState());
  if (rec.changed()) {
    rec.update();
  }
} else {
  LOG.info("Deleted existing unhealthy container record for Container: {}",
      currentContainer.getContainerID());
  rec.delete();
} {code}
To fix this we will have to include *EMPTY_MISSING* to the switch case of 
*retainOrUpdateRecord* so as that even these containers can be deleted.

> Incorrect number of deleted containers shown in Recon UI
> --------------------------------------------------------
>
>                 Key: HDDS-11389
>                 URL: https://issues.apache.org/jira/browse/HDDS-11389
>             Project: Apache Ozone
>          Issue Type: Bug
>          Components: Ozone Recon
>    Affects Versions: 1.4.0
>            Reporter: Arafat Khan
>            Assignee: Arafat Khan
>            Priority: Major
>             Fix For: 1.5.0
>
>
> *Log:*
> {code:java}
> [root~]# ozone admin container list -c=100000 --state=DELETED | grep -c 
> 'containerID'
> 8
> [root~]# curl -s --negotiate --cacert 
> /var/lib/cloudera-scm-agent/agent-cert/cm-auto-global_cacerts.pem -u : 
> "https://ccycloud-5.quasar-hbzjtf.root.comops.site:9889/api/v1/clusterState";
> {"deletedDirs":0,"pipelines":18,"totalDatanodes":8,"healthyDatanodes":8,"storageReport":{"capacity":5390670946304,"used":15558773207,"remaining":4279838674944},"containers":24,"missingContainers":0,"openContainers":16,"deletedContainers":0,"volumes":2,"buckets":2,"keys":1,"keysPendingDeletion":0}
>  {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to