[
https://issues.apache.org/jira/browse/HDFS-9107?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14908887#comment-14908887
]
Hudson commented on HDFS-9107:
------------------------------
FAILURE: Integrated in Hadoop-trunk-Commit #8521 (See
[https://builds.apache.org/job/Hadoop-trunk-Commit/8521/])
HDFS-9107. Prevent NN's unrecoverable death spiral after full GC (Daryn Sharp
via Colin P. McCabe) (cmccabe: rev 4e7c6a653f108d44589f84d78a03d92ee0e8a3c3)
*
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/HeartbeatManager.java
*
hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/blockmanagement/TestHeartbeatHandling.java
Add HDFS-9107 to CHANGES.txt (cmccabe: rev
878504dcaacdc1bea42ad571ad5f4e537c1d7167)
* hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
> Prevent NN's unrecoverable death spiral after full GC
> -----------------------------------------------------
>
> Key: HDFS-9107
> URL: https://issues.apache.org/jira/browse/HDFS-9107
> Project: Hadoop HDFS
> Issue Type: Bug
> Components: namenode
> Affects Versions: 2.0.0-alpha
> Reporter: Daryn Sharp
> Assignee: Daryn Sharp
> Priority: Critical
> Fix For: 2.8.0
>
> Attachments: HDFS-9107.patch, HDFS-9107.patch
>
>
> A full GC pause in the NN that exceeds the dead node interval can lead to an
> infinite cycle of full GCs. The most common situation that precipitates an
> unrecoverable state is a network issue that temporarily cuts off multiple
> racks.
> The NN wakes up and falsely starts marking nodes dead. This bloats the
> replication queues which increases memory pressure. The replications create a
> flurry of incremental block reports and a glut of over-replicated blocks.
> The "dead" nodes heartbeat within seconds. The NN forces a re-registration
> which requires a full block report - more memory pressure. The NN now has to
> invalidate all the over-replicated blocks. The extra blocks are added to
> invalidation queues, tracked in an excess blocks map, etc - much more memory
> pressure.
> All the memory pressure can push the NN into another full GC which repeats
> the entire cycle.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)