[
https://issues.apache.org/jira/browse/HDDS-11389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17877739#comment-17877739
]
Arafat Khan commented on HDDS-11389:
------------------------------------
The root cause of this issue lies in understanding how container states
transition from *CLOSED → DELETING →* *DELETED* in Recon.
* In Recon, the *{{ContainerHealthTask}}* is responsible for identifying which
containers are *DELETED* and facilitating their removal from the list
maintained by the {*}Recon Container Manager{*}.
* The *{{containerDeletedInSCM}}* method within the *{{ContainerHealthTask}}*
makes an RPC call to the SCM to fetch information about containers that were
previously in an unhealthy state. These unhealthy states include
{*}{{MISSING}}{*}, {*}{{MIS_REPLICATED}}{*}, {*}{{UNDER_REPLICATED}}{*}, &
{{{*}OVER_REPLICATED{*}.}}
* For containers in these specific states, we check with SCM to see if they
have been marked as *DELETED* and, if so, update the status on the Recon side
accordingly.
* The problem arises with containers in the *{{MISSING_EMPTY}}* state, a state
that was recently introduced. These containers are considered "missing" because
they have no reported replicas but are also deemed "{*}empty{*}" as they have
no keys mapped to them.
The *{{retainOrUpdateRecord}}* method is called every time before making the
SCM RPC call to check if the container is in one of the specified states
({*}{{MISSING}}{*}, {*}{{MIS_REPLICATED}}{*}, {*}{{UNDER_REPLICATED}}{*}, or
{*}{{OVER_REPLICATED}}{*}). If the container is in one of these states, the
method returns {{{}true{}}}. If not, it returns {{{}false{}}}. Consequently, in
the case of {*}{{MISSING_EMPTY}}{*}, the method will always return {{false}}
because this state is not included in the checks.
* As a result, even though these containers may have been deleted by the SCM,
we never initiate the call to check, since the *{{retainOrUpdateRecord}}*
method will always return {{false}} due to the absence of a check for
*{{MISSING_EMPTY}}* in the switch case.
{code:java}
public static boolean retainOrUpdateRecord(
ContainerHealthStatus container, UnhealthyContainersRecord rec) {
boolean returnValue = false;
switch (UnHealthyContainerStates.valueOf(rec.getContainerState())) {
case MISSING:
returnValue = container.isMissing() && !container.isEmpty();
break;
case MIS_REPLICATED:
returnValue = keepMisReplicatedRecord(container, rec);
break;
case UNDER_REPLICATED:
returnValue = keepUnderReplicatedRecord(container, rec);
break;
case OVER_REPLICATED:
returnValue = keepOverReplicatedRecord(container, rec);
break;
default:
returnValue = false;
}
return returnValue;
} {code}
Code calls *retainOrUpdateRecord* before calling *containerDeletedInSCM* :-
[https://github.com/apache/ozone/blob/f22c6f8dfcc3e2ac822189e207d4cc85fc6fc490/hadoop-ozone/recon/src/main/java/org/apache/hadoop/ozone/recon/fsck/ContainerHealthTask.java#L286]
{code:java}
if (ContainerHealthRecords.retainOrUpdateRecord(currentContainer, rec)) {
// Check if the missing container is deleted in SCM
if (currentContainer.isMissing() &&
containerDeletedInSCM(currentContainer.getContainer())) {
rec.delete();
}
existingRecords.add(rec.getContainerState());
if (rec.changed()) {
rec.update();
}
} else {
LOG.info("Deleted existing unhealthy container record for Container: {}",
currentContainer.getContainerID());
rec.delete();
} {code}
To fix this we will have to include *EMPTY_MISSING* to the switch case of
*retainOrUpdateRecord* so as that even these containers can be deleted.
> Incorrect number of deleted containers shown in Recon UI
> --------------------------------------------------------
>
> Key: HDDS-11389
> URL: https://issues.apache.org/jira/browse/HDDS-11389
> Project: Apache Ozone
> Issue Type: Bug
> Components: Ozone Recon
> Affects Versions: 1.4.0
> Reporter: Arafat Khan
> Assignee: Arafat Khan
> Priority: Major
> Fix For: 1.5.0
>
>
> *Log:*
> {code:java}
> [root~]# ozone admin container list -c=100000 --state=DELETED | grep -c
> 'containerID'
> 8
> [root~]# curl -s --negotiate --cacert
> /var/lib/cloudera-scm-agent/agent-cert/cm-auto-global_cacerts.pem -u :
> "https://ccycloud-5.quasar-hbzjtf.root.comops.site:9889/api/v1/clusterState"
> {"deletedDirs":0,"pipelines":18,"totalDatanodes":8,"healthyDatanodes":8,"storageReport":{"capacity":5390670946304,"used":15558773207,"remaining":4279838674944},"containers":24,"missingContainers":0,"openContainers":16,"deletedContainers":0,"volumes":2,"buckets":2,"keys":1,"keysPendingDeletion":0}
> {code}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]