[
https://issues.apache.org/jira/browse/HDFS-16521?focusedWorklogId=763266&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-763266
]
ASF GitHub Bot logged work on HDFS-16521:
-----------------------------------------
Author: ASF GitHub Bot
Created on: 28/Apr/22 02:51
Start Date: 28/Apr/22 02:51
Worklog Time Spent: 10m
Work Description: virajjasani commented on code in PR #4107:
URL: https://github.com/apache/hadoop/pull/4107#discussion_r860405718
##########
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/metrics/DataNodePeerMetrics.java:
##########
@@ -142,14 +144,28 @@ public void collectThreadLocalStates() {
* than their peers.
*/
public Map<String, Double> getOutliers() {
- // This maps the metric name to the aggregate latency.
- // The metric name is the datanode ID.
- final Map<String, Double> stats =
- sendPacketDownstreamRollingAverages.getStats(
- minOutlierDetectionSamples);
- LOG.trace("DataNodePeerMetrics: Got stats: {}", stats);
-
- return slowNodeDetector.getOutliers(stats);
+ // outlier must be null for source code.
+ if (testOutlier == null) {
+ // This maps the metric name to the aggregate latency.
+ // The metric name is the datanode ID.
+ final Map<String, Double> stats =
+
sendPacketDownstreamRollingAverages.getStats(minOutlierDetectionSamples);
+ LOG.trace("DataNodePeerMetrics: Got stats: {}", stats);
+ return slowNodeDetector.getOutliers(stats);
+ } else {
+ // this happens only for test code.
+ return testOutlier;
+ }
+ }
+
+ /**
+ * Strictly to be used by test code only. Source code is not supposed to use
this. This method
+ * directly sets outlier mapping so that aggregate latency metrics are not
calculated for tests.
+ *
+ * @param outlier outlier directly set by tests.
+ */
+ public void setTestOutliers(Map<String, Double> outlier) {
Review Comment:
Yeah it's very difficult to reproduce the actual slow node in UT, hence had
to do this way. Sure, added comment on `testOutlier` member as well (in
addition to this setter method Javadoc).
##########
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/tools/DFSAdmin.java:
##########
@@ -632,6 +638,20 @@ private static void
printDataNodeReports(DistributedFileSystem dfs,
}
}
+ private static void printSlowDataNodeReports(DistributedFileSystem dfs,
boolean listNodes,
Review Comment:
> One comment on the slow datanode report is that it seems to say nothing
about why the NN thinks it slow;
It's a datanode that determines whether it's peer datanodes are slower, NN
just aggregates all DN reports.
> For example, say something about how in excess a DNs latency is? (Perhaps
this could be added later)
Sure, this can be added as an additional info. Will create a follow-up Jira.
Thanks @saintstack
##########
hadoop-hdfs-project/hadoop-hdfs/src/site/markdown/HDFSCommands.md:
##########
@@ -394,7 +394,7 @@ Usage:
| COMMAND\_OPTION | Description |
|:--
Issue Time Tracking
-------------------
Worklog Id: (was: 763266)
Time Spent: 4.5h (was: 4h 20m)
> DFS API to retrieve slow datanodes
> ----------------------------------
>
> Key: HDFS-16521
> URL: https://issues.apache.org/jira/browse/HDFS-16521
> Project: Hadoop HDFS
> Issue Type: New Feature
> Reporter: Viraj Jasani
> Assignee: Viraj Jasani
> Priority: Major
> Labels: pull-request-available
> Time Spent: 4.5h
> Remaining Estimate: 0h
>
> Providing DFS API to retrieve slow nodes would help add an additional option
> to "dfsadmin -report" that lists slow datanodes info for operators to take a
> look, specifically useful filter for larger clusters.
> The other purpose of such API is for HDFS downstreamers without direct access
> to namenode http port (only rpc port accessible) to retrieve slownodes.
> Moreover,
> [FanOutOneBlockAsyncDFSOutput|https://github.com/apache/hbase/blob/master/hbase-asyncfs/src/main/java/org/apache/hadoop/hbase/io/asyncfs/FanOutOneBlockAsyncDFSOutput.java]
> in HBase currently has to rely on it's own way of marking and excluding slow
> nodes while 1) creating pipelines and 2) handling ack, based on factors like
> the data length of the packet, processing time with last ack timestamp,
> whether flush to replicas is finished etc. If it can utilize slownode API
> from HDFS to exclude nodes appropriately while writing block, a lot of it's
> own post-ack computation of slow nodes can be _saved_ or _improved_ or based
> on further experiment, we could find _better solution_ to manage slow node
> detection logic both in HDFS and HBase. However, in order to collect more
> data points and run more POC around this area, HDFS should provide API for
> downstreamers to efficiently utilize slownode info for such critical
> low-latency use-case (like writing WALs).
--
This message was sent by Atlassian Jira
(v8.20.7#820007)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]