[
https://issues.apache.org/jira/browse/HDFS-14783?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Haibin Huang updated HDFS-14783:
--------------------------------
Description:
SlowPeersReport is calculated by the SampleStat between tow dn, so it can
present on nn's jmx like this:
{code:java}
"SlowPeersReport" :[{"SlowNode":"dn2","ReportingNodes":["dn1"]}]
{code}
the SampleStat is stored in a LinkedBlockingDeque<SumAndCount>, it won't be
removed until the queue is full and a newest one is generated. Therefore, if
dn1 don't send any packet to dn2 for a long time, the old SampleStat will keep
staying in the queue, and will be used to calculated slowpeer.I think these old
SampleStats should be considered as expired message and ignore them when
generating a new SlowPeersReport.
was:
SlowPeersReport in namenode's jmx can tell us which datanode is slow node, and
it is calculated by the average duration between two datanode sending packet.
Here is an example, if dn1 send packet to dn2 tasks too long in average (over
the *upperLimitLatency*), you will see SlowPeersReport in namenode's jmx like
this :
{code:java}
"SlowPeersReport" :[{"SlowNode":"dn2","ReportingNodes":["dn1"]}]
{code}
However, if dn1 just sending some packet to dn2 with a slow speed in the
beginning , then didn't send any packet to dn2 for a long time, which will keep
the abovementioned SlowPeersReport staying on namenode's jmx . I think this
SlowPeersReport might be an expired message, because the network between dn1
and dn2 may have returned to normal, but the SlowPeersReport is still on
nameonode's jmx until next time dn1 sending packet to dn2. So I use a timestamp
to record when an *org.apache.hadoop.metrics2.util.SampleStat* is created, and
calculate the average duration with the valid *SampleStat ,* which is judged by
it timestamp.
> expired SampleStat need to be removed from SlowPeersReport
> ----------------------------------------------------------
>
> Key: HDFS-14783
> URL: https://issues.apache.org/jira/browse/HDFS-14783
> Project: Hadoop HDFS
> Issue Type: Bug
> Reporter: Haibin Huang
> Assignee: Haibin Huang
> Priority: Major
> Attachments: HDFS-14783, HDFS-14783-001.patch, HDFS-14783-002.patch
>
>
> SlowPeersReport is calculated by the SampleStat between tow dn, so it can
> present on nn's jmx like this:
> {code:java}
> "SlowPeersReport" :[{"SlowNode":"dn2","ReportingNodes":["dn1"]}]
> {code}
> the SampleStat is stored in a LinkedBlockingDeque<SumAndCount>, it won't be
> removed until the queue is full and a newest one is generated. Therefore, if
> dn1 don't send any packet to dn2 for a long time, the old SampleStat will
> keep staying in the queue, and will be used to calculated slowpeer.I think
> these old SampleStats should be considered as expired message and ignore them
> when generating a new SlowPeersReport.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]