[ https://issues.apache.org/jira/browse/HDFS-16521?focusedWorklogId=762716&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-762716 ]
ASF GitHub Bot logged work on HDFS-16521: ----------------------------------------- Author: ASF GitHub Bot Created on: 27/Apr/22 05:32 Start Date: 27/Apr/22 05:32 Worklog Time Spent: 10m Work Description: jojochuang commented on code in PR #4107: URL: https://github.com/apache/hadoop/pull/4107#discussion_r859381403 ########## hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSNamesystem.java: ########## @@ -4914,6 +4914,33 @@ int getNumberOfDatanodes(DatanodeReportType type) { } } + DatanodeInfo[] slowDataNodesReport() throws IOException { + String operationName = "slowDataNodesReport"; + DatanodeInfo[] datanodeInfos; + checkSuperuserPrivilege(operationName); Review Comment: does it need to require super user privilege? ########## hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/tools/DFSAdmin.java: ########## @@ -433,7 +433,7 @@ static int run(DistributedFileSystem dfs, String[] argv, int idx) throws IOExcep */ private static final String commonUsageSummary = "\t[-report [-live] [-dead] [-decommissioning] " + - "[-enteringmaintenance] [-inmaintenance]]\n" + + "[-enteringmaintenance] [-inmaintenance] [-slownodes]]\n" + Review Comment: The corresponding documentation needs to update when CLI commands are added/updated. ########## hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/tools/DFSAdmin.java: ########## @@ -632,6 +638,20 @@ private static void printDataNodeReports(DistributedFileSystem dfs, } } + private static void printSlowDataNodeReports(DistributedFileSystem dfs, boolean listNodes, Review Comment: Can you provide a sample output? It would be confusing, I guess. I suspect you would need some kind of header to distinguish from the other data node reports. ########## hadoop-hdfs-project/hadoop-hdfs-client/src/main/java/org/apache/hadoop/hdfs/protocol/ClientProtocol.java: ########## @@ -1868,4 +1868,16 @@ BatchedEntries<OpenFileEntry> listOpenFiles(long prevId, */ @AtMostOnce void satisfyStoragePolicy(String path) throws IOException; + + /** + * Get report on all of the slow Datanodes. Slow running datanodes are identified based on + * the Outlier detection algorithm, if slow peer tracking is enabled for the DFS cluster. + * + * @return Datanode report for slow running datanodes. + * @throws IOException If an I/O error occurs. + */ + @Idempotent + @ReadOnly + DatanodeInfo[] getSlowDatanodeReport() throws IOException; Review Comment: I just want to check with every one that it is okay to have an array of objects as the return value. I think it's fine but just want to check with every one, because once we decide the the interface it can't be changed later. ########## hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/tools/DFSAdmin.java: ########## @@ -433,7 +433,7 @@ static int run(DistributedFileSystem dfs, String[] argv, int idx) throws IOExcep */ private static final String commonUsageSummary = "\t[-report [-live] [-dead] [-decommissioning] " + - "[-enteringmaintenance] [-inmaintenance]]\n" + + "[-enteringmaintenance] [-inmaintenance] [-slownodes]]\n" + Review Comment: In fact it would appear confusion to HDFS administrators. These subcommands are meant to filter the DNs in these states, and "slownodes" is not a defined DataNode state. Issue Time Tracking ------------------- Worklog Id: (was: 762716) Time Spent: 3.5h (was: 3h 20m) > DFS API to retrieve slow datanodes > ---------------------------------- > > Key: HDFS-16521 > URL: https://issues.apache.org/jira/browse/HDFS-16521 > Project: Hadoop HDFS > Issue Type: New Feature > Reporter: Viraj Jasani > Assignee: Viraj Jasani > Priority: Major > Labels: pull-request-available > Time Spent: 3.5h > Remaining Estimate: 0h > > Providing DFS API to retrieve slow nodes would help add an additional option > to "dfsadmin -report" that lists slow datanodes info for operators to take a > look, specifically useful filter for larger clusters. > The other purpose of such API is for HDFS downstreamers without direct access > to namenode http port (only rpc port accessible) to retrieve slownodes. > Moreover, > [FanOutOneBlockAsyncDFSOutput|https://github.com/apache/hbase/blob/master/hbase-asyncfs/src/main/java/org/apache/hadoop/hbase/io/asyncfs/FanOutOneBlockAsyncDFSOutput.java] > in HBase currently has to rely on it's own way of marking and excluding slow > nodes while 1) creating pipelines and 2) handling ack, based on factors like > the data length of the packet, processing time with last ack timestamp, > whether flush to replicas is finished etc. If it can utilize slownode API > from HDFS to exclude nodes appropriately while writing block, a lot of it's > own post-ack computation of slow nodes can be _saved_ or _improved_ or based > on further experiment, we could find _better solution_ to manage slow node > detection logic both in HDFS and HBase. However, in order to collect more > data points and run more POC around this area, HDFS should provide API for > downstreamers to efficiently utilize slownode info for such critical > low-latency use-case (like writing WALs). -- This message was sent by Atlassian Jira (v8.20.7#820007) --------------------------------------------------------------------- To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org