[jira] [Commented] (HDDS-11341) Add dashboard for HDDS health and replication progress

Ethan Rose (Jira) Tue, 03 Sep 2024 07:22:04 -0700


    [ 
https://issues.apache.org/jira/browse/HDDS-11341?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17878898#comment-17878898
 ]


Ethan Rose commented on HDDS-11341:
-----------------------------------

I'm not sure what metrics are available currently and what needs to be added. 
Metrics will come from SCM and datanodes. Here's a list of things it would be 
good to track with metrics:
 * Number of datanodes in each 
[NodeState|https://github.com/apache/ozone/blob/98369a8343f12479af99db4c83696945d40c1e3d/hadoop-hdds/interface-client/src/main/proto/hdds.proto#L156]
 (Used to track alive and dead nodes)
 * Number of datanodes in each 
[NodeOperationalState|https://github.com/apache/ozone/blob/98369a8343f12479af99db4c83696945d40c1e3d/hadoop-hdds/interface-client/src/main/proto/hdds.proto#L163]
 (Used to track decommissioning nodes)
 * Number of containers in each state, similar to {{ozone admin container 
report}}
 * Number of healthy and failed datanode storage volumes.
 * Number of containers pending replication on decommissioning nodes.
 * Incoming and outgoing replication traffic in bytes for each datanode.
 * Number of replication and reconstruction commands queued up in on each 
datanode.

> Add dashboard for HDDS health and replication progress
> ------------------------------------------------------
>
>                 Key: HDDS-11341
>                 URL: https://issues.apache.org/jira/browse/HDDS-11341
>             Project: Apache Ozone
>          Issue Type: Improvement
>          Components: Ozone Dashboards
>            Reporter: Ethan Rose
>            Priority: Major
>
> Add a Grafana dashboard to show information about datanode health, ongoing 
> and pending replication and reconstruction tasks, and the amount of data 
> being moved between nodes due to these tasks. This board will be useful to 
> monitor during disk failure, node failure, node decom, and maintenance.
> SCM replication manager likely has a lot of the metrics for ongoing tasks 
> already. We may need to add more metrics to datanodes to monitor tasks that 
> are ongoing (not just those that are queued) and the amount of data being 
> moved. I think some datanode command queue and handler related metrics are 
> unused as well and those can be checked/removed/updated as part of this PR.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (HDDS-11341) Add dashboard for HDDS health and replication progress

Reply via email to