[ 
https://issues.apache.org/jira/browse/HDFS-11194?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arpit Agarwal updated HDFS-11194:
---------------------------------
    Attachment: HDFS-11194.01.patch

Attached a patch. This builds upon the downstream peer latencies collected by 
DataNodes during the write pipeline (HDFS-10917).

Over time, the DataNodes will have sufficient samples to determine which peers 
are slow relative to the rest. This logic should be conservative with some 
high/low thresholds as safeguards so the set of outliers is tiny compared to 
all peers. These peers can be reported to the NameNode occasionally, allowing 
it to detect the top N slow nodes ranked by the number of peers that found them 
slow.

The attached patch looks large but most of it is plumbing. The interesting 
changes are in two classes:
# SlowNodeDetector (on the DataNode) – Find high outliers given aggregate peer 
latencies.
# SlowPeerTracker (on the NameNode) – Accumulate reports from DataNodes and 
expose the top N (currently 5) slow nodes via NameNode JMX, an idea borrowed 
from HDFS-6982.

The idea of collecting peer statistics to find slow nodes also came up at the 
HDFS BoF at a Hadoop Summit (proposed by Allen W., I think). The statistical 
analysis has ideas from [~szetszwo].

Thank you for the comments [~apurtell] and [~drankye]. All of the above is off 
by default. Assuming 3% of the nodes in the cluster are flagged as outliers by 
each node (any higher and we need to further tone down the outlier detection), 
then in a 3000 node cluster the expected NN state is 3000 * (3000 * 3%) * 25 
bytes/report ~ 7MB.

> Maintain aggregated peer performance metrics on NameNode
> --------------------------------------------------------
>
>                 Key: HDFS-11194
>                 URL: https://issues.apache.org/jira/browse/HDFS-11194
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: namenode
>    Affects Versions: 2.8.0
>            Reporter: Xiaobing Zhou
>            Assignee: Arpit Agarwal
>         Attachments: HDFS-11194.01.patch
>
>
> The metrics collected in HDFS-10917 should be reported to and aggregated on 
> NameNode as part of heart beat messages. This will make is easy to expose it 
> through JMX to users who are interested in them.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

Reply via email to