[ https://issues.apache.org/jira/browse/HDFS-9107?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Konstantin Shvachko updated HDFS-9107: -------------------------------------- Fix Version/s: 2.7.5 2.9.0 Pushed to branch-2.7. Only a minor conflict in imports. Updated Fix versions. > Prevent NN's unrecoverable death spiral after full GC > ----------------------------------------------------- > > Key: HDFS-9107 > URL: https://issues.apache.org/jira/browse/HDFS-9107 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode > Affects Versions: 2.0.0-alpha > Reporter: Daryn Sharp > Assignee: Daryn Sharp > Priority: Critical > Fix For: 2.8.0, 2.9.0, 3.0.0-alpha1, 2.7.5 > > Attachments: HDFS-9107.patch, HDFS-9107.patch > > > A full GC pause in the NN that exceeds the dead node interval can lead to an > infinite cycle of full GCs. The most common situation that precipitates an > unrecoverable state is a network issue that temporarily cuts off multiple > racks. > The NN wakes up and falsely starts marking nodes dead. This bloats the > replication queues which increases memory pressure. The replications create a > flurry of incremental block reports and a glut of over-replicated blocks. > The "dead" nodes heartbeat within seconds. The NN forces a re-registration > which requires a full block report - more memory pressure. The NN now has to > invalidate all the over-replicated blocks. The extra blocks are added to > invalidation queues, tracked in an excess blocks map, etc - much more memory > pressure. > All the memory pressure can push the NN into another full GC which repeats > the entire cycle. -- This message was sent by Atlassian JIRA (v6.4.14#64029) --------------------------------------------------------------------- To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org