[
https://issues.apache.org/jira/browse/HDDS-7098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17699792#comment-17699792
]
Mladjan Gadzic commented on HDDS-7098:
--------------------------------------
[~erose] with Neil's help I managed to run ozone-ha and tinker around with
.container files. The steps I took:
* change docker-compose.yaml in order to perserve DN data after restart
* exercise ozone freok rk to generate keys
* close container (it has 3 replicas at this point)
* shutdown DN
* change .container file for shutdown DN
* start DN
* container has 2 replicas at this point
* Recon API for unhealty container response was as such:
{code:java}
{
"missingCount": 0,
"underReplicatedCount": 1,
"overReplicatedCount": 0,
"misReplicatedCount": 0,
"containers": [
{
"containerID": 1,
"containerState": "UNDER_REPLICATED",
"unhealthySince": 1678440084452,
"expectedReplicaCount": 3,
"actualReplicaCount": 2,
"replicaDeltaCount": 1,
"reason": null,
"keys": 334,
"pipelineID": "fdbfccde-6089-425f-a8b9-d6fe9c91ad27",
"replicas": [
{
"containerId": 1,
"datanodeUuid": "ad68bb1a-dcc1-4428-a0ae-e3d478ff5d6a",
"datanodeHost": "ozone-ha-datanode3-1.ozone-ha_default",
"firstSeenTime": 1678439458475,
"lastSeenTime": 1678440059359,
"lastBcsId": 3639
},
{
"containerId": 1,
"datanodeUuid": "8e501294-86ec-4d4f-8ef0-a88fd06a0e89",
"datanodeHost": "ozone-ha-datanode2-1.ozone-ha_default",
"firstSeenTime": 1678439458476,
"lastSeenTime": 1678439999398,
"lastBcsId": 3639
},
{
"containerId": 1,
"datanodeUuid": "0580e4c2-66be-4a7e-ba67-b5f66757123f",
"datanodeHost": "57d1298da886",
"firstSeenTime": 1678439458437,
"lastSeenTime": 1678439819434,
"lastBcsId": 0
}
]
}
]
} {code}
Unhealty replica is the one with lastBcsId=0. It looks like Recon API provides
a way to point out unhealthy containers and its replicas.
Some things I noticed along the way:
* .container file could not be parsed because checksum was different than one
expected
* container was tried to be replicated but without success with an exception:
{code:java}
2023-03-09 17:20:55,247 [ContainerReplicationThread-0] INFO
replication.DownloadAndImportReplicator: Starting replication of container 1
from
[48f641fe-7d92-437b-9ff3-2d47359552f8(ozone-ha-datanode2-1.ozone-ha_default/192.168.96.12),
07caeb3d-4dc5-4e18-83ee-cf08d4b50297(ozone-ha-datanode3-1.ozone-ha_default/192.168.96.9)]
using NO_COMPRESSION
2023-03-09 17:20:56,173 [grpc-default-executor-0] INFO
replication.GrpcReplicationClient: Container 1 is downloaded to
/data/hdds/tmp/container-copy/container-1.tar
2023-03-09 17:20:56,181 [ContainerReplicationThread-0] INFO
replication.DownloadAndImportReplicator: Container 1 is downloaded with size
3606016, starting to import.
2023-03-09 17:20:58,163 [ContainerReplicationThread-0] ERROR
replication.DownloadAndImportReplicator: Container 1 replication was
unsuccessful.
org.apache.hadoop.hdds.scm.container.common.helpers.StorageContainerException:
Container 1 unpack failed because ContainerFile
/data/hdds/hdds/CID-8922c93b-c9ae-4f14-99b0-644432ce5dde/current/containerDir0/1
already exists
at
org.apache.hadoop.ozone.container.keyvalue.TarContainerPacker.unpackContainerData(TarContainerPacker.java:109)
at
org.apache.hadoop.ozone.container.keyvalue.KeyValueContainer.importContainerData(KeyValueContainer.java:526)
at
org.apache.hadoop.ozone.container.keyvalue.KeyValueHandler.importContainer(KeyValueHandler.java:1032)
at
org.apache.hadoop.ozone.container.ozoneimpl.ContainerController.importContainer(ContainerController.java:161)
at
org.apache.hadoop.ozone.container.replication.ContainerImporter.importContainer(ContainerImporter.java:101)
at
org.apache.hadoop.ozone.container.replication.DownloadAndImportReplicator.replicate(DownloadAndImportReplicator.java:92)
at
org.apache.hadoop.ozone.container.replication.MeasuredReplicator.replicate(MeasuredReplicator.java:83)
at
org.apache.hadoop.ozone.container.replication.ReplicationTask.runTask(ReplicationTask.java:112)
at
org.apache.hadoop.ozone.container.replication.ReplicationSupervisor$TaskRunner.run(ReplicationSupervisor.java:212)
at
java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at
java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at java.base/java.lang.Thread.run(Thread.java:829)
2023-03-09 17:20:58,164 [ContainerReplicationThread-0] ERROR
replication.ReplicationSupervisor: Failed ReplicationTask{status=FAILED,
cmd={replicateContainerCommand: containerId: 1, replicaIndex: 0, sourceNodes:
[48f641fe-7d92-437b-9ff3-2d47359552f8(ozone-ha-datanode2-1.ozone-ha_default/192.168.96.12),
07caeb3d-4dc5-4e18-83ee-cf08d4b50297(ozone-ha-datanode3-1.ozone-ha_default/192.168.96.9)],
priority: NORMAL}, queued=2023-03-09T17:20:55.232214Z} {code}
* Scanner scanned necessarry files but no repair took place
{code:java}
2023-03-13 15:25:05,077 [Periodic HDDS volume checker] INFO
volume.ThrottledAsyncChecker: Scheduling a check for /data/hdds/hdds
2023-03-13 15:25:05,083 [Periodic HDDS volume checker] INFO
volume.StorageVolumeChecker: Scheduled health check for volume /data/hdds/hdds
2023-03-13 15:25:05,108 [Periodic HDDS volume checker] INFO
volume.ThrottledAsyncChecker: Scheduling a check for /data/metadata/ratis
2023-03-13 15:25:05,108 [Periodic HDDS volume checker] INFO
volume.StorageVolumeChecker: Scheduled health check for volume
/data/metadata/ratis {code}
I've checked Ozone Datanode Scanners V2.pdf attached to the ticket you've
mentioned but I could not find anything saying how to run scanner on demand. Is
that even possible?
> Provide a way for admin to identify all unhealthy container replicas
> --------------------------------------------------------------------
>
> Key: HDDS-7098
> URL: https://issues.apache.org/jira/browse/HDDS-7098
> Project: Apache Ozone
> Issue Type: Sub-task
> Reporter: Ethan Rose
> Assignee: Devesh Kumar Singh
> Priority: Major
> Attachments: MissingContainers.png, image-2023-03-02-16-01-07-814.png
>
>
> Currently UNHEALTHY is a state that a container replica can be in
> (ContainerReplicaProto#State), but not a state that the container can be in
> overall (LifeCycleState). This means {{ozone admin container list}} has no
> info about unhealthy containers, because it currently does not print replica
> information. [Recon's
> API|https://ozone.apache.org/docs/current/interface/reconapi.html] and UI
> does not expose replica information either. The only way to determine
> unhealthy containers is to run {{ozone admin container info <ID>}} for a
> container that is already suspected to have unhealthy replicas. This jira
> aims to provide a way to identify and filter container replica states,
> through either Recon's UI, Recon's REST API, or client CLI.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]