[
https://issues.apache.org/jira/browse/HDDS-7327?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17635606#comment-17635606
]
Siyao Meng commented on HDDS-7327:
----------------------------------
In one of the discussions with [~zitadombi], we figured it would be useful to
log UNHEALTHY container replicas reported by datanodes in a new SQL table
{{UNHEALTHY_REPLICAS}}. So that it can be shown in the Recon UI later.
We would potentially log those UNHEALTHY replicas in
{{ReconContainerReportHandler#onMessage}}, or under
{{checkAndAddNewContainerBatch}}.
A container replica is uniquely identified by:
1. {{containerID}}
2. and datanode UUID ({{DatanodeDetails}})
as can be seen in
[{{ContainerReplicaInfo}}|https://github.com/apache/ozone/blob/722c444aea306b41cdbf5acbf13ecb8d51178746/hadoop-hdds/common/src/main/java/org/apache/hadoop/hdds/scm/container/ContainerReplicaInfo.java#L28-L32].
Conversely, once those replicas are removed (from SCM) or recovered (the
datanodes would report those as healthy again, somehow, maybe through some
manual surgery on those datanodes), we would remove those replicas from
UNHEALTHY_REPLICAS table.
> Recon to note down replica states
> ---------------------------------
>
> Key: HDDS-7327
> URL: https://issues.apache.org/jira/browse/HDDS-7327
> Project: Apache Ozone
> Issue Type: Task
> Components: Ozone Recon
> Reporter: Siyao Meng
> Assignee: Zita Dombi
> Priority: Major
>
> Related previous discussion: HDDS-7098
> Right now it seems that Recon only takes note of the overall container health
> state in the Recon SQL DB:
> {code:bash}
> ij version 10.14
> ij> connect 'jdbc:derby:ozone_recon_derby.db';
> ij> show tables;
> TABLE_SCHEM |TABLE_NAME |REMARKS
> ------------------------------------------------------------------------
> ...
> SYSIBM |SYSDUMMY1 |
> RECON |CLUSTER_GROWTH_DAILY |
> RECON |FILE_COUNT_BY_SIZE |
> RECON |GLOBAL_STATS |
> RECON |RECON_TASK_STATUS |
> RECON |UNHEALTHY_CONTAINERS |
> 28 rows selected
> ij> select * from RECON.UNHEALTHY_CONTAINERS;
> container_id |container_state |in_state_since
> |expected_r&|actual_rep&|replica_de&|reason
> -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
> 1 |UNDER_REPLICATED|1665692819704 |3 |2
> |1 |NULL
> {code}
> but Recon does not record the [health state of individual
> replicas|https://github.com/apache/ozone/blob/1e546103f0650dadc29cc5b6c931c0040e2d1d9c/hadoop-hdds/interface-server/src/main/proto/ScmServerDatanodeHeartbeatProtocol.proto#L209-L220]
> in the container. This will be useful for users to check replica states in
> Recon.
> We might want to persist the info to Recon SQL DB only when datanodes report
> that a replica is unhealthy. Do not persist healthy ones to avoid too many
> writes (can lead to performance issues)
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]