[
https://issues.apache.org/jira/browse/HDFS-16521?focusedWorklogId=747939&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-747939
]
ASF GitHub Bot logged work on HDFS-16521:
-----------------------------------------
Author: ASF GitHub Bot
Created on: 25/Mar/22 19:29
Start Date: 25/Mar/22 19:29
Worklog Time Spent: 10m
Work Description: virajjasani commented on pull request #4107:
URL: https://github.com/apache/hadoop/pull/4107#issuecomment-1079369473
@ayushtkn While I agree that JMX metric for slownode is already available,
not every downstreamer might have access to it directly, for instance in K8S
managed clusters, unless port forward is enabled (not so common case in prod),
HDFS downstreamer would not be able to access JMX metrics. We have similar case
with `DFS.getDataNodeStats()` API, it provides live/decomm/dead node info,
however such node info is already used by JMX metrics, but when it's about
downstream or deployment management application trying to use such info, DFS
APIs are preferred and not JMX metrics due to similar concerns mentioned above.
Moreover, it's not only about downstreamer using the API, we should also
provide `dfsadmin -report` option to report slownode info for operators,
something that only an API can offer.
We only expose slowNode and reportingNodes info for each unique slow peer
detection, we do not expose other imp data e.g. how many blocks are currently
available, what is the DFS usage etc with the same JMX metric, and we don't
even need to. However, providing as much concrete info related to each slow
node would be API's responsibility.
With API, we also don't need to keep tuning
`dfs.datanode.max.nodes.to.report` to adjust how many top N slow nodes we want
to get exposed (which is a nice limitation for JMX metrics for sure).
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
Issue Time Tracking
-------------------
Worklog Id: (was: 747939)
Time Spent: 20m (was: 10m)
> DFS API to retrieve slow datanodes
> ----------------------------------
>
> Key: HDFS-16521
> URL: https://issues.apache.org/jira/browse/HDFS-16521
> Project: Hadoop HDFS
> Issue Type: New Feature
> Reporter: Viraj Jasani
> Assignee: Viraj Jasani
> Priority: Major
> Labels: pull-request-available
> Time Spent: 20m
> Remaining Estimate: 0h
>
> In order to build some automation around slow datanodes that regularly show
> up in the slow peer tracking report, e.g. decommission such nodes and queue
> them up for external processing and add them back later to the cluster after
> fixing issues etc, we should expose DFS API to retrieve all slow nodes at a
> given time.
> Providing such API would also help add an additional option to "dfsadmin
> -report" that lists slow datanodes info for operators to take a look,
> specifically useful filter for larger clusters.
--
This message was sent by Atlassian Jira
(v8.20.1#820001)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]