[ https://issues.apache.org/jira/browse/HDFS-10220?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Nicolas Fraison reopened HDFS-10220: ------------------------------------ Hi Ravi, Thanks for the feedback but for me the issue we face is not the same than the one indicated in HDFS-4882. In fact this patch has been applied in the hadoop 2.6.0 package in cdh 5.4.0 (sorry but cloudera backport some patch from different version of hadoop so I can't provide you the exact hadoop release) and from the spurce code I can confirm that it is applied in cdh5.5.0 also: https://archive.cloudera.com/cdh5/cdh/5/hadoop-2.6.0-cdh5.4.0.releasenotes.html. Nicolas > Namenode failover due to too long loking in LeaseManager.Monitor > ---------------------------------------------------------------- > > Key: HDFS-10220 > URL: https://issues.apache.org/jira/browse/HDFS-10220 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode > Reporter: Nicolas Fraison > Priority: Minor > Fix For: 2.6.1 > > Attachments: HADOOP-10220.001.patch, threaddump_zkfc.txt > > > I have faced a namenode failover due to unresponsive namenode detected by the > zkfc with lot's of WARN messages (5 millions) like this one: > _org.apache.hadoop.hdfs.StateChange: BLOCK* internalReleaseLease: All > existing blocks are COMPLETE, lease removed, file closed._ > On the threaddump taken by the zkfc there are lots of thread blocked due to a > lock. > Looking at the code, there are a lock taken by the LeaseManager.Monitor when > some lease must be released. Due to the really big number of lease to be > released the namenode has taken too many times to release them blocking all > other tasks and making the zkfc thinking that the namenode was not > available/stuck. > The idea of this patch is to limit the number of leased released each time we > check for lease so the lock won't be taken for a too long time period. -- This message was sent by Atlassian JIRA (v6.3.4#6332)