[
https://issues.apache.org/jira/browse/HDFS-14783?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
HaiBin Huang updated HDFS-14783:
--------------------------------
Description:
SlowPeersReport in namenode's jmx can tell us which datanode is slow node, and
it is calculated by the average duration between two datanode sending packet.
Here is an example, if dn1 send packet to dn2 tasks too long in average (over
the *upperLimitLatency*), you will see SlowPeersReport in namenode's jmx like
this :
"SlowPeersReport" :[{"SlowNode":"dn2","ReportingNodes":["dn1"]}]
However, if dn1 just sending some packet to dn2 with a slow speed in the
beginning , then didn't send any packet to dn2 for a long time, which will keep
the abovementioned SlowPeersReport staying on namenode's jmx . I think this
SlowPeersReport might be an expired message, because the network between dn1
and dn2 may have returned to normal, but the SlowPeersReport is still on
nameonode's jmx until next time dn1 sending packet to dn2. So I use a timestamp
to record when an *org.apache.hadoop.metrics2.util.SampleStat* is created, and
calculate the average duration with the valid *SampleStat ,* which is judged by
it timestamp.
> expired SlowPeersReport will keep staying on namenode's jmx
> -----------------------------------------------------------
>
> Key: HDFS-14783
> URL: https://issues.apache.org/jira/browse/HDFS-14783
> Project: Hadoop HDFS
> Issue Type: Bug
> Components: hdfs
> Reporter: HaiBin Huang
> Priority: Major
> Attachments: HDFS-14783
>
>
> SlowPeersReport in namenode's jmx can tell us which datanode is slow node,
> and it is calculated by the average duration between two datanode sending
> packet. Here is an example, if dn1 send packet to dn2 tasks too long in
> average (over the *upperLimitLatency*), you will see SlowPeersReport in
> namenode's jmx like this :
> "SlowPeersReport" :[{"SlowNode":"dn2","ReportingNodes":["dn1"]}]
> However, if dn1 just sending some packet to dn2 with a slow speed in the
> beginning , then didn't send any packet to dn2 for a long time, which will
> keep the abovementioned SlowPeersReport staying on namenode's jmx . I think
> this SlowPeersReport might be an expired message, because the network between
> dn1 and dn2 may have returned to normal, but the SlowPeersReport is still on
> nameonode's jmx until next time dn1 sending packet to dn2. So I use a
> timestamp to record when an *org.apache.hadoop.metrics2.util.SampleStat* is
> created, and calculate the average duration with the valid *SampleStat ,*
> which is judged by it timestamp.
--
This message was sent by Atlassian Jira
(v8.3.2#803003)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]