[jira] [Commented] (HDDS-7098) Provide a way for admin to identify all unhealthy container replicas

Mladjan Gadzic (Jira) Mon, 13 Mar 2023 11:59:04 -0700


    [ 
https://issues.apache.org/jira/browse/HDDS-7098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17699792#comment-17699792
 ]


Mladjan Gadzic commented on HDDS-7098:
--------------------------------------

[~erose] with Neil's help I managed to run ozone-ha and tinker around with 
.container files. The steps I took:
 * change docker-compose.yaml in order to perserve DN data after restart
 * exercise ozone freok rk to generate keys
 * close container (it has 3 replicas at this point)
 * shutdown DN
 * change .container file for shutdown DN
 * start DN
 * container has 2 replicas at this point
 * Recon API for unhealty container response was as such:

{code:java}
{
    "missingCount": 0,
    "underReplicatedCount": 1,
    "overReplicatedCount": 0,
    "misReplicatedCount": 0,
    "containers": [
        {
            "containerID": 1,
            "containerState": "UNDER_REPLICATED",
            "unhealthySince": 1678440084452,
            "expectedReplicaCount": 3,
            "actualReplicaCount": 2,
            "replicaDeltaCount": 1,
            "reason": null,
            "keys": 334,
            "pipelineID": "fdbfccde-6089-425f-a8b9-d6fe9c91ad27",
            "replicas": [
                {
                    "containerId": 1,
                    "datanodeUuid": "ad68bb1a-dcc1-4428-a0ae-e3d478ff5d6a",
                    "datanodeHost": "ozone-ha-datanode3-1.ozone-ha_default",
                    "firstSeenTime": 1678439458475,
                    "lastSeenTime": 1678440059359,
                    "lastBcsId": 3639
                },
                {
                    "containerId": 1,
                    "datanodeUuid": "8e501294-86ec-4d4f-8ef0-a88fd06a0e89",
                    "datanodeHost": "ozone-ha-datanode2-1.ozone-ha_default",
                    "firstSeenTime": 1678439458476,
                    "lastSeenTime": 1678439999398,
                    "lastBcsId": 3639
                },
                {
                    "containerId": 1,
                    "datanodeUuid": "0580e4c2-66be-4a7e-ba67-b5f66757123f",
                    "datanodeHost": "57d1298da886",
                    "firstSeenTime": 1678439458437,
                    "lastSeenTime": 1678439819434,
                    "lastBcsId": 0
                }
            ]
        }
    ]
} {code}
Unhealty replica is the one with lastBcsId=0. It looks like Recon API provides 
a way to point out unhealthy containers and its replicas.

 

Some things I noticed along the way:
 * .container file could not be parsed because checksum was different than one 
expected
 * container was tried to be replicated but without success with an exception: 

{code:java}
2023-03-09 17:20:55,247 [ContainerReplicationThread-0] INFO 
replication.DownloadAndImportReplicator: Starting replication of container 1 
from 
[48f641fe-7d92-437b-9ff3-2d47359552f8(ozone-ha-datanode2-1.ozone-ha_default/192.168.96.12),
 
07caeb3d-4dc5-4e18-83ee-cf08d4b50297(ozone-ha-datanode3-1.ozone-ha_default/192.168.96.9)]
 using NO_COMPRESSION
2023-03-09 17:20:56,173 [grpc-default-executor-0] INFO 
replication.GrpcReplicationClient: Container 1 is downloaded to 
/data/hdds/tmp/container-copy/container-1.tar
2023-03-09 17:20:56,181 [ContainerReplicationThread-0] INFO 
replication.DownloadAndImportReplicator: Container 1 is downloaded with size 
3606016, starting to import.
2023-03-09 17:20:58,163 [ContainerReplicationThread-0] ERROR 
replication.DownloadAndImportReplicator: Container 1 replication was 
unsuccessful.
org.apache.hadoop.hdds.scm.container.common.helpers.StorageContainerException: 
Container 1 unpack failed because ContainerFile 
/data/hdds/hdds/CID-8922c93b-c9ae-4f14-99b0-644432ce5dde/current/containerDir0/1
 already exists
        at 
org.apache.hadoop.ozone.container.keyvalue.TarContainerPacker.unpackContainerData(TarContainerPacker.java:109)
        at 
org.apache.hadoop.ozone.container.keyvalue.KeyValueContainer.importContainerData(KeyValueContainer.java:526)
        at 
org.apache.hadoop.ozone.container.keyvalue.KeyValueHandler.importContainer(KeyValueHandler.java:1032)
        at 
org.apache.hadoop.ozone.container.ozoneimpl.ContainerController.importContainer(ContainerController.java:161)
        at 
org.apache.hadoop.ozone.container.replication.ContainerImporter.importContainer(ContainerImporter.java:101)
        at 
org.apache.hadoop.ozone.container.replication.DownloadAndImportReplicator.replicate(DownloadAndImportReplicator.java:92)
        at 
org.apache.hadoop.ozone.container.replication.MeasuredReplicator.replicate(MeasuredReplicator.java:83)
        at 
org.apache.hadoop.ozone.container.replication.ReplicationTask.runTask(ReplicationTask.java:112)
        at 
org.apache.hadoop.ozone.container.replication.ReplicationSupervisor$TaskRunner.run(ReplicationSupervisor.java:212)
        at 
java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
        at 
java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
        at java.base/java.lang.Thread.run(Thread.java:829)
2023-03-09 17:20:58,164 [ContainerReplicationThread-0] ERROR 
replication.ReplicationSupervisor: Failed ReplicationTask{status=FAILED, 
cmd={replicateContainerCommand: containerId: 1, replicaIndex: 0, sourceNodes: 
[48f641fe-7d92-437b-9ff3-2d47359552f8(ozone-ha-datanode2-1.ozone-ha_default/192.168.96.12),
 
07caeb3d-4dc5-4e18-83ee-cf08d4b50297(ozone-ha-datanode3-1.ozone-ha_default/192.168.96.9)],
 priority: NORMAL}, queued=2023-03-09T17:20:55.232214Z} {code}

 * Scanner scanned necessarry files but no repair took place

{code:java}
2023-03-13 15:25:05,077 [Periodic HDDS volume checker] INFO 
volume.ThrottledAsyncChecker: Scheduling a check for /data/hdds/hdds
2023-03-13 15:25:05,083 [Periodic HDDS volume checker] INFO 
volume.StorageVolumeChecker: Scheduled health check for volume /data/hdds/hdds
2023-03-13 15:25:05,108 [Periodic HDDS volume checker] INFO 
volume.ThrottledAsyncChecker: Scheduling a check for /data/metadata/ratis
2023-03-13 15:25:05,108 [Periodic HDDS volume checker] INFO 
volume.StorageVolumeChecker: Scheduled health check for volume 
/data/metadata/ratis {code}

 

I've checked Ozone Datanode Scanners V2.pdf attached to the ticket you've 
mentioned but I could not find anything saying how to run scanner on demand. Is 
that even possible?

> Provide a way for admin to identify all unhealthy container replicas
> --------------------------------------------------------------------
>
>                 Key: HDDS-7098
>                 URL: https://issues.apache.org/jira/browse/HDDS-7098
>             Project: Apache Ozone
>          Issue Type: Sub-task
>            Reporter: Ethan Rose
>            Assignee: Devesh Kumar Singh
>            Priority: Major
>         Attachments: MissingContainers.png, image-2023-03-02-16-01-07-814.png
>
>
> Currently UNHEALTHY is a state that a container replica can be in 
> (ContainerReplicaProto#State), but not a state that the container can be in 
> overall (LifeCycleState). This means {{ozone admin container list}} has no 
> info about unhealthy containers, because it currently does not print replica 
> information. [Recon's 
> API|https://ozone.apache.org/docs/current/interface/reconapi.html] and UI 
> does not expose replica information either. The only way to determine 
> unhealthy containers is to run {{ozone admin container info <ID>}} for a 
> container that is already suspected to have unhealthy replicas. This jira 
> aims to provide a way to identify and filter container replica states, 
> through either Recon's UI, Recon's REST API, or client CLI.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (HDDS-7098) Provide a way for admin to identify all unhealthy container replicas

Reply via email to