[ 
https://issues.apache.org/jira/browse/HDFS-14783?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Haibin Huang updated HDFS-14783:
--------------------------------
    Description: 
SlowPeersReport is calculated by the SampleStat between tow dn, so it can 
present on nn's jmx like this:
{code:java}
"SlowPeersReport" :[{"SlowNode":"dn2","ReportingNodes":["dn1"]}]
{code}
the SampleStat is stored in a LinkedBlockingDeque<SumAndCount>, it won't be 
removed until the queue is full and a newest one is generated. Therefore, if 
dn1 don't send any packet to dn2 for a long time, the old SampleStat will keep 
staying in the queue, and will be used to calculated slowpeer.I think these old 
SampleStats should be considered as expired message and ignore them when 
generating a new SlowPeersReport.

  was:
SlowPeersReport in namenode's jmx can tell us which datanode is slow node, and 
it is calculated by the average duration between two datanode sending packet. 
Here is an example, if dn1 send packet to dn2 tasks too long in average (over 
the *upperLimitLatency*), you will see SlowPeersReport in namenode's jmx like 
this :
{code:java}
"SlowPeersReport" :[{"SlowNode":"dn2","ReportingNodes":["dn1"]}]
{code}
However, if dn1 just sending some packet to dn2 with a slow speed in the 
beginning , then didn't send any packet to dn2 for a long time, which will keep 
the abovementioned SlowPeersReport staying on namenode's jmx . I think this 
SlowPeersReport might be an expired message, because the network between dn1 
and dn2 may have returned to normal, but the SlowPeersReport is still on 
nameonode's jmx until next time dn1 sending packet to dn2. So I use a timestamp 
to record when an *org.apache.hadoop.metrics2.util.SampleStat* is created, and 
calculate the average duration with the valid *SampleStat ,* which is judged by 
it  timestamp.


> expired SampleStat need to be removed from SlowPeersReport
> ----------------------------------------------------------
>
>                 Key: HDFS-14783
>                 URL: https://issues.apache.org/jira/browse/HDFS-14783
>             Project: Hadoop HDFS
>          Issue Type: Bug
>            Reporter: Haibin Huang
>            Assignee: Haibin Huang
>            Priority: Major
>         Attachments: HDFS-14783, HDFS-14783-001.patch, HDFS-14783-002.patch
>
>
> SlowPeersReport is calculated by the SampleStat between tow dn, so it can 
> present on nn's jmx like this:
> {code:java}
> "SlowPeersReport" :[{"SlowNode":"dn2","ReportingNodes":["dn1"]}]
> {code}
> the SampleStat is stored in a LinkedBlockingDeque<SumAndCount>, it won't be 
> removed until the queue is full and a newest one is generated. Therefore, if 
> dn1 don't send any packet to dn2 for a long time, the old SampleStat will 
> keep staying in the queue, and will be used to calculated slowpeer.I think 
> these old SampleStats should be considered as expired message and ignore them 
> when generating a new SlowPeersReport.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to