[jira] [Reopened] (HDFS-10220) Namenode failover due to too long loking in LeaseManager.Monitor

Nicolas Fraison (JIRA) Tue, 29 Mar 2016 14:09:17 -0700

     [ 
https://issues.apache.org/jira/browse/HDFS-10220?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Nicolas Fraison reopened HDFS-10220:
------------------------------------

Hi Ravi,

Thanks for the feedback but for me the issue we face is not the same than the 
one indicated in HDFS-4882.
In fact this patch has been applied in the hadoop 2.6.0 package in cdh 5.4.0 
(sorry but cloudera backport some patch from different version of hadoop so I 
can't provide you the exact hadoop release) and from the spurce code I can 
confirm that it is applied in cdh5.5.0 also: 
https://archive.cloudera.com/cdh5/cdh/5/hadoop-2.6.0-cdh5.4.0.releasenotes.html.

Nicolas

> Namenode failover due to too long loking in LeaseManager.Monitor
> ----------------------------------------------------------------
>
>                 Key: HDFS-10220
>                 URL: https://issues.apache.org/jira/browse/HDFS-10220
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: namenode
>            Reporter: Nicolas Fraison
>            Priority: Minor
>             Fix For: 2.6.1
>
>         Attachments: HADOOP-10220.001.patch, threaddump_zkfc.txt
>
>
> I have faced a namenode failover due to unresponsive namenode detected by the 
> zkfc with lot's of WARN messages (5 millions) like this one:
> _org.apache.hadoop.hdfs.StateChange: BLOCK* internalReleaseLease: All 
> existing blocks are COMPLETE, lease removed, file closed._
> On the threaddump taken by the zkfc there are lots of thread blocked due to a 
> lock.
> Looking at the code, there are a lock taken by the LeaseManager.Monitor when 
> some lease must be released. Due to the really big number of lease to be 
> released the namenode has taken too many times to release them blocking all 
> other tasks and making the zkfc thinking that the namenode was not 
> available/stuck.
> The idea of this patch is to limit the number of leased released each time we 
> check for lease so the lock won't be taken for a too long time period.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Reopened] (HDFS-10220) Namenode failover due to too long loking in LeaseManager.Monitor

Reply via email to