[ 
https://issues.apache.org/jira/browse/HDDS-11481?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17886489#comment-17886489
 ] 

Ethan Rose commented on HDDS-11481:
-----------------------------------

Ok I see the issue. Persisting node membership in SCM adds some complexity. 
Currently SCM persists items needed for correctness like container metadata and 
pending deletions, and transient items like node membership are kept in memory. 
The exception to this is Ratis pipelines, and their persistence does add 
unfortunate complexity to SCM datanode interactions. HDFS has grown very 
complicated over the years and IMO we should try to avoid adding similar 
complexity to Ozone if we can get the same results with simpler solutions.

 For the problem at hand, I can think of two options:

If you would like to do this yourself through a file, you can keep the list of 
all datanodes in a file, and use {{ozone admin datanode list}} to poll the 
current set of registered datanodes and diff from that file.

The other option would be to use metrics, and maybe create a Grafana dashboard 
for cluster membership. The issue with the current setup if that registration 
is a push mechanism from datanodes via heartbeats. So if datanodes have not 
registered, you can't see them at all from hubs like Recon and SCM. What we 
need here is a pull mechanism that pulls the status from every datanode 
directly. Metrics are collected this way. Using prometheus and optionally 
Grafana you should be able to pull the registration status of each datanode as 
a metric to see which ones are registered or not. This could become part of a 
larger "cluster membership" Grafana dashboard which reports statuses like 
health state, operational state, registration, pipeline membership counts etc. 
If the metrics we need for this do not currently exist it would be good to add 
them. If you have any interest in creating a 
[dashboard|https://github.com/apache/ozone/tree/cce2f969a85323441c476aaeaf27d45b081b0c2f/hadoop-ozone/dist/src/main/compose/common/grafana/dashboards]
 that would be a great contribution as well.

> Enhanced SCM Support for DataNode Management
> --------------------------------------------
>
>                 Key: HDDS-11481
>                 URL: https://issues.apache.org/jira/browse/HDDS-11481
>             Project: Apache Ozone
>          Issue Type: Wish
>          Components: SCM
>            Reporter: Shilun Fan
>            Assignee: Shilun Fan
>            Priority: Major
>         Attachments: screenshot-1.png
>
>
> I plan to enhance SCM's support for DataNode management, including features 
> like blacklist and whitelist.
> Compared to the DataNode management functionality in HDFS, SCM's DataNode 
> management still has some incomplete features:
> 1. For instance, the blacklist and whitelist functionality is missing. 
> Currently, all DataNodes can register with SCM once they are started, but for 
> the sake of completeness, we should implement a blacklist feature.
> 2. The display list function for DataNodes in SCM is not user-friendly, with 
> the following issues: 
> -The list does not support global sorting. 
> - It cannot display the decommissioning progress. Once the decommissioning 
> process begins, we can only passively refresh the page or rely on metrics to 
> make judgments. 
> - Key information about DataNodes is missing from the list, such as the 
> number of containers and the number of pipelines.
> 3. In HDFS, if multiple DataNode versions are detected in the cluster, there 
> are helpful prompts, but SCM's recognition and support for multiple DataNode 
> versions are insufficient.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to