[ https://issues.apache.org/jira/browse/HDFS-11194?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Arpit Agarwal updated HDFS-11194: --------------------------------- Attachment: HDFS-11194.01.patch Attached a patch. This builds upon the downstream peer latencies collected by DataNodes during the write pipeline (HDFS-10917). Over time, the DataNodes will have sufficient samples to determine which peers are slow relative to the rest. This logic should be conservative with some high/low thresholds as safeguards so the set of outliers is tiny compared to all peers. These peers can be reported to the NameNode occasionally, allowing it to detect the top N slow nodes ranked by the number of peers that found them slow. The attached patch looks large but most of it is plumbing. The interesting changes are in two classes: # SlowNodeDetector (on the DataNode) – Find high outliers given aggregate peer latencies. # SlowPeerTracker (on the NameNode) – Accumulate reports from DataNodes and expose the top N (currently 5) slow nodes via NameNode JMX, an idea borrowed from HDFS-6982. The idea of collecting peer statistics to find slow nodes also came up at the HDFS BoF at a Hadoop Summit (proposed by Allen W., I think). The statistical analysis has ideas from [~szetszwo]. Thank you for the comments [~apurtell] and [~drankye]. All of the above is off by default. Assuming 3% of the nodes in the cluster are flagged as outliers by each node (any higher and we need to further tone down the outlier detection), then in a 3000 node cluster the expected NN state is 3000 * (3000 * 3%) * 25 bytes/report ~ 7MB. > Maintain aggregated peer performance metrics on NameNode > -------------------------------------------------------- > > Key: HDFS-11194 > URL: https://issues.apache.org/jira/browse/HDFS-11194 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode > Affects Versions: 2.8.0 > Reporter: Xiaobing Zhou > Assignee: Arpit Agarwal > Attachments: HDFS-11194.01.patch > > > The metrics collected in HDFS-10917 should be reported to and aggregated on > NameNode as part of heart beat messages. This will make is easy to expose it > through JMX to users who are interested in them. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org