[ 
https://issues.apache.org/jira/browse/HDDS-7098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17700087#comment-17700087
 ] 

Mladjan Gadzic commented on HDDS-7098:
--------------------------------------

[~erose] thanks for a quick response and thorough info!
{quote}You should be able to use the [docker compose definition from the 
upgrade acceptance 
tests|https://github.com/apache/ozone/blob/e84aa4c4ea7e3d094630bb285afd2f4b38232426/hadoop-ozone/dist/src/main/compose/upgrade/compose/ha/docker-compose.yaml]
 to persist information through restarts.
{quote}
I did that manually modifying docker-compose file.
{quote}hdds.container.checksum.verification.enabled=false
{quote}
This was the missing piece of a puzzle. After configuring Ozone to skip 
checksum comparions like:
{noformat}
OZONE-SITE.XML_hdds.container.checksum.verification.enabled=false{noformat}
I was able to start up DN (after closing container, modifying .container file 
and shutting down DN) and get replica to show in closed container.
{code:bash}
bash-4.2$ ozone admin container info 1
Container id: 1
Pipeline id: 890b795c-7c16-4efb-9522-77844503d378
Container State: CLOSED
Datanodes: 
[d93b61e9-05ce-4051-9903-8e3b5d88a118/ozone-ha-datanode3-1.ozone-ha_default,
ea75905f-fedb-4997-a809-c6400d6d8be4/ozone-ha-datanode2-1.ozone-ha_default,
6cde010a-af01-4d5b-be3e-89c1941d27d8/ozone-ha-datanode1-1.ozone-ha_default]
Replicas: [State: UNHEALTHY; ReplicaIndex: 0; Origin: 
d93b61e9-05ce-4051-9903-8e3b5d88a118; Location: 
d93b61e9-05ce-4051-9903-8e3b5d88a118/ozone-ha-datanode3-1.ozone-ha_default,
State: CLOSED; ReplicaIndex: 0; Origin: ea75905f-fedb-4997-a809-c6400d6d8be4; 
Location: 
ea75905f-fedb-4997-a809-c6400d6d8be4/ozone-ha-datanode2-1.ozone-ha_default,
State: CLOSED; ReplicaIndex: 0; Origin: 6cde010a-af01-4d5b-be3e-89c1941d27d8; 
Location: 
6cde010a-af01-4d5b-be3e-89c1941d27d8/ozone-ha-datanode1-1.ozone-ha_default]
{code}
{code:bash}
bash-4.2$ ozone admin container report
Container Summary Report generated at 2023-03-14T10:10:40Z
==========================================================

Container State Summary
=======================
OPEN: 2
CLOSING: 0
QUASI_CLOSED: 0
CLOSED: 1
DELETING: 0
DELETED: 0
RECOVERING: 0

Container Health Summary
========================
UNDER_REPLICATED: 1
MIS_REPLICATED: 0
OVER_REPLICATED: 0
MISSING: 0
UNHEALTHY: 1
EMPTY: 0
OPEN_UNHEALTHY: 0
QUASI_CLOSED_STUCK: 0

First 100 UNDER_REPLICATED containers:
#1

First 100 UNHEALTHY containers:
#1
{code}
With mentioned changes I was able to get slightly different response from Recon 
API /api/v1/containers:
{code:json}
{
    "missingCount": 0,
    "underReplicatedCount": 1,
    "overReplicatedCount": 0,
    "misReplicatedCount": 0,
    "containers": [
        {
            "containerID": 1,
            "containerState": "UNDER_REPLICATED",
            "unhealthySince": 1678788485018,
            "expectedReplicaCount": 3,
            "actualReplicaCount": 2,
            "replicaDeltaCount": 1,
            "reason": null,
            "keys": 334,
            "pipelineID": "b1773636-249b-4af5-9e0e-bcb937b3aafc",
            "replicas": [
                {
                    "containerId": 1,
                    "datanodeUuid": "ea75905f-fedb-4997-a809-c6400d6d8be4",
                    "datanodeHost": "ozone-ha-datanode2-1.ozone-ha_default",
                    "firstSeenTime": 1678788164045,
                    "lastSeenTime": 1678788464919,
                    "lastBcsId": 3602
                },
                {
                    "containerId": 1,
                    "datanodeUuid": "6cde010a-af01-4d5b-be3e-89c1941d27d8",
                    "datanodeHost": "ozone-ha-datanode1-1.ozone-ha_default",
                    "firstSeenTime": 1678788164043,
                    "lastSeenTime": 1678788464912,
                    "lastBcsId": 3602
                },
                {
                    "containerId": 1,
                    "datanodeUuid": "d93b61e9-05ce-4051-9903-8e3b5d88a118",
                    "datanodeHost": "f7818f21f520",
                    "firstSeenTime": 1678788164041,
                    "lastSeenTime": 1678788440500,
                    "lastBcsId": 3602
                }
            ]
        }
    ]
}
{code}
However,
{quote}I don't see anything in the json response that indicates which 
replica(s) are unhealthy.
{quote}
you are right. There is no way to distinguish unhealthy replica(s) except from 
noticing "datanodeHost" value is not "known" but it is not something I'd rather 
not rely on. An interesting idea/solution might be to expand Recon API response 
with replica state, something like:
{code:json}
{
    "containerId": 1,
    "datanodeUuid": "d93b61e9-05ce-4051-9903-8e3b5d88a118",
    "datanodeHost": "f7818f21f520",
    "firstSeenTime": 1678788164041,
    "lastSeenTime": 1678788440500,
    "lastBcsId": 3602,
    "state": "UNHEALTHY"
}
{code}
Does this make sense?

> Provide a way for admin to identify all unhealthy container replicas
> --------------------------------------------------------------------
>
>                 Key: HDDS-7098
>                 URL: https://issues.apache.org/jira/browse/HDDS-7098
>             Project: Apache Ozone
>          Issue Type: Sub-task
>            Reporter: Ethan Rose
>            Assignee: Devesh Kumar Singh
>            Priority: Major
>         Attachments: MissingContainers.png, image-2023-03-02-16-01-07-814.png
>
>
> Currently UNHEALTHY is a state that a container replica can be in 
> (ContainerReplicaProto#State), but not a state that the container can be in 
> overall (LifeCycleState). This means {{ozone admin container list}} has no 
> info about unhealthy containers, because it currently does not print replica 
> information. [Recon's 
> API|https://ozone.apache.org/docs/current/interface/reconapi.html] and UI 
> does not expose replica information either. The only way to determine 
> unhealthy containers is to run {{ozone admin container info <ID>}} for a 
> container that is already suspected to have unhealthy replicas. This jira 
> aims to provide a way to identify and filter container replica states, 
> through either Recon's UI, Recon's REST API, or client CLI.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to