[jira] [Comment Edited] (HDDS-11341) Add dashboard for HDDS health and replication progress

Ethan Rose (Jira) Tue, 03 Sep 2024 13:49:04 -0700


    [ 
https://issues.apache.org/jira/browse/HDDS-11341?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17878898#comment-17878898
 ]


Ethan Rose edited comment on HDDS-11341 at 9/3/24 8:47 PM:
-----------------------------------------------------------

I'm not sure what metrics are available currently and what needs to be added. 
Metrics will come from SCM and datanodes. Here's a list of things it would be 
good to track with metrics:
 * Number of datanodes in each 
[NodeState|https://github.com/apache/ozone/blob/98369a8343f12479af99db4c83696945d40c1e3d/hadoop-hdds/interface-client/src/main/proto/hdds.proto#L156]
 (Used to track alive and dead nodes)
 * Number of datanodes in each 
[NodeOperationalState|https://github.com/apache/ozone/blob/98369a8343f12479af99db4c83696945d40c1e3d/hadoop-hdds/interface-client/src/main/proto/hdds.proto#L163]
 (Used to track decommissioning nodes)
 * Number of containers in each state, similar to {{ozone admin container 
report}}
 * Number of healthy and failed datanode storage volumes.
 * Number of containers pending replication on decommissioning nodes.
 * Incoming and outgoing replication traffic in bytes for each datanode.
 * Number of replication and reconstruction commands queued up in on each 
datanode.
 * Storage utilization of each datanode (might already be in another dashboard 
as well)

We could add container balancer metrics to this dashboard as well, but that 
might be too cluttered. It might be better to leave balancer in a separate 
dashboard since that is run on demand vs the items identified here which happen 
in response to failures.


was (Author: erose):
I'm not sure what metrics are available currently and what needs to be added. 
Metrics will come from SCM and datanodes. Here's a list of things it would be 
good to track with metrics:
 * Number of datanodes in each 
[NodeState|https://github.com/apache/ozone/blob/98369a8343f12479af99db4c83696945d40c1e3d/hadoop-hdds/interface-client/src/main/proto/hdds.proto#L156]
 (Used to track alive and dead nodes)
 * Number of datanodes in each 
[NodeOperationalState|https://github.com/apache/ozone/blob/98369a8343f12479af99db4c83696945d40c1e3d/hadoop-hdds/interface-client/src/main/proto/hdds.proto#L163]
 (Used to track decommissioning nodes)
 * Number of containers in each state, similar to {{ozone admin container 
report}}
 * Number of healthy and failed datanode storage volumes.
 * Number of containers pending replication on decommissioning nodes.
 * Incoming and outgoing replication traffic in bytes for each datanode.
 * Number of replication and reconstruction commands queued up in on each 
datanode.

> Add dashboard for HDDS health and replication progress
> ------------------------------------------------------
>
>                 Key: HDDS-11341
>                 URL: https://issues.apache.org/jira/browse/HDDS-11341
>             Project: Apache Ozone
>          Issue Type: Improvement
>          Components: Ozone Dashboards
>            Reporter: Ethan Rose
>            Priority: Major
>
> Add a Grafana dashboard to show information about datanode health, ongoing 
> and pending replication and reconstruction tasks, and the amount of data 
> being moved between nodes due to these tasks. This board will be useful to 
> monitor during disk failure, node failure, node decom, and maintenance.
> SCM replication manager likely has a lot of the metrics for ongoing tasks 
> already. We may need to add more metrics to datanodes to monitor tasks that 
> are ongoing (not just those that are queued) and the amount of data being 
> moved. I think some datanode command queue and handler related metrics are 
> unused as well and those can be checked/removed/updated as part of this PR.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Comment Edited] (HDDS-11341) Add dashboard for HDDS health and replication progress

Reply via email to