[
https://issues.apache.org/jira/browse/HDDS-11341?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17878898#comment-17878898
]
Ethan Rose commented on HDDS-11341:
-----------------------------------
I'm not sure what metrics are available currently and what needs to be added.
Metrics will come from SCM and datanodes. Here's a list of things it would be
good to track with metrics:
* Number of datanodes in each
[NodeState|https://github.com/apache/ozone/blob/98369a8343f12479af99db4c83696945d40c1e3d/hadoop-hdds/interface-client/src/main/proto/hdds.proto#L156]
(Used to track alive and dead nodes)
* Number of datanodes in each
[NodeOperationalState|https://github.com/apache/ozone/blob/98369a8343f12479af99db4c83696945d40c1e3d/hadoop-hdds/interface-client/src/main/proto/hdds.proto#L163]
(Used to track decommissioning nodes)
* Number of containers in each state, similar to {{ozone admin container
report}}
* Number of healthy and failed datanode storage volumes.
* Number of containers pending replication on decommissioning nodes.
* Incoming and outgoing replication traffic in bytes for each datanode.
* Number of replication and reconstruction commands queued up in on each
datanode.
> Add dashboard for HDDS health and replication progress
> ------------------------------------------------------
>
> Key: HDDS-11341
> URL: https://issues.apache.org/jira/browse/HDDS-11341
> Project: Apache Ozone
> Issue Type: Improvement
> Components: Ozone Dashboards
> Reporter: Ethan Rose
> Priority: Major
>
> Add a Grafana dashboard to show information about datanode health, ongoing
> and pending replication and reconstruction tasks, and the amount of data
> being moved between nodes due to these tasks. This board will be useful to
> monitor during disk failure, node failure, node decom, and maintenance.
> SCM replication manager likely has a lot of the metrics for ongoing tasks
> already. We may need to add more metrics to datanodes to monitor tasks that
> are ongoing (not just those that are queued) and the amount of data being
> moved. I think some datanode command queue and handler related metrics are
> unused as well and those can be checked/removed/updated as part of this PR.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]