[
https://issues.apache.org/jira/browse/HDDS-7098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17699869#comment-17699869
]
Ethan Rose commented on HDDS-7098:
----------------------------------
Thanks for checking this out [~mladjangadzic]. You should be able to use the
[docker compose definition from the upgrade acceptance
tests|https://github.com/apache/ozone/blob/e84aa4c4ea7e3d094630bb285afd2f4b38232426/hadoop-ozone/dist/src/main/compose/upgrade/compose/ha/docker-compose.yaml]
to persist information through restarts.
{quote}Unhealty replica is the one with lastBcsId=0. It looks like Recon API
provides a way to point out unhealthy containers and its replicas.
{quote}
I don't see anything in the json response that indicates which replica(s) are
unhealthy.
{quote}.container file could not be parsed because checksum was different than
one expected
{quote}
This is expected behavior. To get around this, you would need to set
{{hdds.container.checksum.verification.enabled=false}} for testing purposes.
Since the container was not loaded, the replica was never marked unhealthy.
Other configs that might help to speed up testing are
{code:java}
OZONE-SITE.XML_hdds.scm.replication.thread.interval=5s
OZONE-SITE.XML_hdds.scm.wait.time.after.safemode.exit=10s
{code}
{quote}container was tried to be replicated but without success with an
exception:
{quote}
Even though the container was not loaded due to the checksum mismatch, the
files are still on the disk. SCM thinks there is a replica missing since the
container was not loaded, and tries to replicate it to the only datanode not
reporting the replica. However, the datanode still has the replica's files in
the way, so replication fails. The system should be able to work around this if
there were more nodes in the cluster.
{quote}Scanner scanned necessary files but no repair took place
{quote}
This is the volume scanner, which just checks that the disks for each
hdds.datanode.dir are still present. The container scanner that would actually
check the contents of each container is still WIP and off by default, but it
can at least check identify corruption and move containers to unhealthy state.
You would need to adjust configs listed in [this
class|https://github.com/apache/ozone/blob/e84aa4c4ea7e3d094630bb285afd2f4b38232426/hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/ozoneimpl/ContainerScannerConfiguration.java]
for it to run.
{quote}I could not find anything saying how to run scanner on demand. Is that
even possible?
{quote}
"on demand" container scanning refers to scanning a container when the datanode
has an error reading or writing to it. If you are looking for a way to run the
scanner from the command line, we don't have that yet. It is planned in
HDDS-8056, although current proposal is a read-only option for debugging to not
interfere with the potentially running datanode process. The configs mentioned
in the class above can be set to have the scanner run at short intervals and
given a small amount of data in the cluster it should get through it reasonably
quick.
> Provide a way for admin to identify all unhealthy container replicas
> --------------------------------------------------------------------
>
> Key: HDDS-7098
> URL: https://issues.apache.org/jira/browse/HDDS-7098
> Project: Apache Ozone
> Issue Type: Sub-task
> Reporter: Ethan Rose
> Assignee: Devesh Kumar Singh
> Priority: Major
> Attachments: MissingContainers.png, image-2023-03-02-16-01-07-814.png
>
>
> Currently UNHEALTHY is a state that a container replica can be in
> (ContainerReplicaProto#State), but not a state that the container can be in
> overall (LifeCycleState). This means {{ozone admin container list}} has no
> info about unhealthy containers, because it currently does not print replica
> information. [Recon's
> API|https://ozone.apache.org/docs/current/interface/reconapi.html] and UI
> does not expose replica information either. The only way to determine
> unhealthy containers is to run {{ozone admin container info <ID>}} for a
> container that is already suspected to have unhealthy replicas. This jira
> aims to provide a way to identify and filter container replica states,
> through either Recon's UI, Recon's REST API, or client CLI.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]