[
https://issues.apache.org/jira/browse/HDDS-11481?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17886489#comment-17886489
]
Ethan Rose commented on HDDS-11481:
-----------------------------------
Ok I see the issue. Persisting node membership in SCM adds some complexity.
Currently SCM persists items needed for correctness like container metadata and
pending deletions, and transient items like node membership are kept in memory.
The exception to this is Ratis pipelines, and their persistence does add
unfortunate complexity to SCM datanode interactions. HDFS has grown very
complicated over the years and IMO we should try to avoid adding similar
complexity to Ozone if we can get the same results with simpler solutions.
For the problem at hand, I can think of two options:
If you would like to do this yourself through a file, you can keep the list of
all datanodes in a file, and use {{ozone admin datanode list}} to poll the
current set of registered datanodes and diff from that file.
The other option would be to use metrics, and maybe create a Grafana dashboard
for cluster membership. The issue with the current setup if that registration
is a push mechanism from datanodes via heartbeats. So if datanodes have not
registered, you can't see them at all from hubs like Recon and SCM. What we
need here is a pull mechanism that pulls the status from every datanode
directly. Metrics are collected this way. Using prometheus and optionally
Grafana you should be able to pull the registration status of each datanode as
a metric to see which ones are registered or not. This could become part of a
larger "cluster membership" Grafana dashboard which reports statuses like
health state, operational state, registration, pipeline membership counts etc.
If the metrics we need for this do not currently exist it would be good to add
them. If you have any interest in creating a
[dashboard|https://github.com/apache/ozone/tree/cce2f969a85323441c476aaeaf27d45b081b0c2f/hadoop-ozone/dist/src/main/compose/common/grafana/dashboards]
that would be a great contribution as well.
> Enhanced SCM Support for DataNode Management
> --------------------------------------------
>
> Key: HDDS-11481
> URL: https://issues.apache.org/jira/browse/HDDS-11481
> Project: Apache Ozone
> Issue Type: Wish
> Components: SCM
> Reporter: Shilun Fan
> Assignee: Shilun Fan
> Priority: Major
> Attachments: screenshot-1.png
>
>
> I plan to enhance SCM's support for DataNode management, including features
> like blacklist and whitelist.
> Compared to the DataNode management functionality in HDFS, SCM's DataNode
> management still has some incomplete features:
> 1. For instance, the blacklist and whitelist functionality is missing.
> Currently, all DataNodes can register with SCM once they are started, but for
> the sake of completeness, we should implement a blacklist feature.
> 2. The display list function for DataNodes in SCM is not user-friendly, with
> the following issues:
> -The list does not support global sorting.
> - It cannot display the decommissioning progress. Once the decommissioning
> process begins, we can only passively refresh the page or rely on metrics to
> make judgments.
> - Key information about DataNodes is missing from the list, such as the
> number of containers and the number of pipelines.
> 3. In HDFS, if multiple DataNode versions are detected in the cluster, there
> are helpful prompts, but SCM's recognition and support for multiple DataNode
> versions are insufficient.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]