[jira] [Commented] (HDDS-7098) Provide a way for admin to identify all unhealthy container replicas

Ethan Rose (Jira) Mon, 13 Mar 2023 15:55:10 -0700


    [ 
https://issues.apache.org/jira/browse/HDDS-7098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17699869#comment-17699869
 ]


Ethan Rose commented on HDDS-7098:
----------------------------------

Thanks for checking this out [~mladjangadzic]. You should be able to use the 
[docker compose definition from the upgrade acceptance 
tests|https://github.com/apache/ozone/blob/e84aa4c4ea7e3d094630bb285afd2f4b38232426/hadoop-ozone/dist/src/main/compose/upgrade/compose/ha/docker-compose.yaml]
 to persist information through restarts.
{quote}Unhealty replica is the one with lastBcsId=0. It looks like Recon API 
provides a way to point out unhealthy containers and its replicas.
{quote}
I don't see anything in the json response that indicates which replica(s) are 
unhealthy.
{quote}.container file could not be parsed because checksum was different than 
one expected
{quote}
This is expected behavior. To get around this, you would need to set 
{{hdds.container.checksum.verification.enabled=false}} for testing purposes. 
Since the container was not loaded, the replica was never marked unhealthy. 
Other configs that might help to speed up testing are
{code:java}
OZONE-SITE.XML_hdds.scm.replication.thread.interval=5s
OZONE-SITE.XML_hdds.scm.wait.time.after.safemode.exit=10s
{code}
{quote}container was tried to be replicated but without success with an 
exception:
{quote}
Even though the container was not loaded due to the checksum mismatch, the 
files are still on the disk. SCM thinks there is a replica missing since the 
container was not loaded, and tries to replicate it to the only datanode not 
reporting the replica. However, the datanode still has the replica's files in 
the way, so replication fails. The system should be able to work around this if 
there were more nodes in the cluster.
{quote}Scanner scanned necessary files but no repair took place
{quote}
This is the volume scanner, which just checks that the disks for each 
hdds.datanode.dir are still present. The container scanner that would actually 
check the contents of each container is still WIP and off by default, but it 
can at least check identify corruption and move containers to unhealthy state. 
You would need to adjust configs listed in [this 
class|https://github.com/apache/ozone/blob/e84aa4c4ea7e3d094630bb285afd2f4b38232426/hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/ozoneimpl/ContainerScannerConfiguration.java]
 for it to run.
{quote}I could not find anything saying how to run scanner on demand. Is that 
even possible?
{quote}
"on demand" container scanning refers to scanning a container when the datanode 
has an error reading or writing to it. If you are looking for a way to run the 
scanner from the command line, we don't have that yet. It is planned in 
HDDS-8056, although current proposal is a read-only option for debugging to not 
interfere with the potentially running datanode process. The configs mentioned 
in the class above can be set to have the scanner run at short intervals and 
given a small amount of data in the cluster it should get through it reasonably 
quick.

> Provide a way for admin to identify all unhealthy container replicas
> --------------------------------------------------------------------
>
>                 Key: HDDS-7098
>                 URL: https://issues.apache.org/jira/browse/HDDS-7098
>             Project: Apache Ozone
>          Issue Type: Sub-task
>            Reporter: Ethan Rose
>            Assignee: Devesh Kumar Singh
>            Priority: Major
>         Attachments: MissingContainers.png, image-2023-03-02-16-01-07-814.png
>
>
> Currently UNHEALTHY is a state that a container replica can be in 
> (ContainerReplicaProto#State), but not a state that the container can be in 
> overall (LifeCycleState). This means {{ozone admin container list}} has no 
> info about unhealthy containers, because it currently does not print replica 
> information. [Recon's 
> API|https://ozone.apache.org/docs/current/interface/reconapi.html] and UI 
> does not expose replica information either. The only way to determine 
> unhealthy containers is to run {{ozone admin container info <ID>}} for a 
> container that is already suspected to have unhealthy replicas. This jira 
> aims to provide a way to identify and filter container replica states, 
> through either Recon's UI, Recon's REST API, or client CLI.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (HDDS-7098) Provide a way for admin to identify all unhealthy container replicas

Reply via email to