[
https://issues.apache.org/jira/browse/HADOOP-16947?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17257005#comment-17257005
]
Haibin Huang commented on HADOOP-16947:
---------------------------------------
Update the patch to fix checkstyle. Hi [~elgoiri] [~ayushtkn] [~hexiaoqiao]
[~weichiu], can you take a look at this. Now if a node has a problem at network
card, it will be reported as slow peer to namenode by other reporting node, and
we remove this slow peer node out of cluster, but the SlowPeersReport will not
remove this node, because the reporting nodes always update their SumAndCount
deque with old data. If we want to use SlowPeersReport to do a real-time
monitor, it will confuse us. So we need a timestamp to define when the old data
will be invalid.
org.apache.hadoop.metrics2.lib.MutableRollingAverages#rollOverAvgs
{code:java}
final SumAndCount sumAndCount = new SumAndCount(
rate.lastStat().total(),
rate.lastStat().numSamples());
{code}
> Stale record should be remove when generating SlowPeersReport
> -------------------------------------------------------------
>
> Key: HADOOP-16947
> URL: https://issues.apache.org/jira/browse/HADOOP-16947
> Project: Hadoop Common
> Issue Type: Bug
> Reporter: Haibin Huang
> Assignee: Haibin Huang
> Priority: Major
> Attachments: HADOOP-16947-001.patch, HADOOP-16947-002.patch,
> HADOOP-16947-003.patch, HADOOP-16947-004.patch, HADOOP-16947-005.patch,
> HADOOP-16947-006.patch, HADOOP-16947-007.patch, HDFS-14783,
> HDFS-14783-001.patch, HDFS-14783-002.patch, HDFS-14783-003.patch,
> HDFS-14783-004.patch, HDFS-14783-005.patch
>
>
> SlowPeersReport is generated by the SampleStat between tow dn, so it can
> present on nn's jmx like this:
> {code:java}
> "SlowPeersReport" :[{"SlowNode":"dn2","ReportingNodes":["dn1"]}]
> {code}
> In each period, MutableRollingAverages will do a rollOverAvgs(), it will
> generate a SumAndCount object which is based on SampleStat, and store it in a
> LinkedBlockingDeque<SumAndCount>, the deque will be used to generate
> SlowPeersReport. And the old member of deque won't be removed until the queue
> is full. However, if dn1 don't send any packet to dn2 in the last of
> 36*300_000 ms, the deque will be filled with an old member, because the
> number of last SampleStat never change.I think these old SampleStats should
> be considered as expired message and ignore them when generating a new
> SlowPeersReport.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]