[ 
https://issues.apache.org/jira/browse/HDFS-10220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15253354#comment-15253354
 ] 

Walter Su commented on HDFS-10220:
----------------------------------

You are right. The only question I have is I have no idea if the default value 
1000 is a right choice, or the approach of throttling the rate. I kind of hope 
it's out-of-the-box. Small companies with small clusters have cluster 
administrators who may not quite understand what the configuration means.

bq. Counting the time since better in term of funcionnality but I'm afraid 
about adding extra computation time on this check compare to a simple count of 
files. The idea is not to spend more times to release those lease. What is your 
feeling about it?
I believe the overhead can be ignored. Or we can calc the elapse time after 
processing a small batch.

I saw {{BlockManager.BlockReportProcessingThread}} release the writeLock if it 
holds it more than 4ms. Do you think the same idea works here?

> Namenode failover due to too long loking in LeaseManager.Monitor
> ----------------------------------------------------------------
>
>                 Key: HDFS-10220
>                 URL: https://issues.apache.org/jira/browse/HDFS-10220
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: namenode
>            Reporter: Nicolas Fraison
>            Assignee: Nicolas Fraison
>            Priority: Minor
>         Attachments: HADOOP-10220.001.patch, HADOOP-10220.002.patch, 
> HADOOP-10220.003.patch, HADOOP-10220.004.patch, threaddump_zkfc.txt
>
>
> I have faced a namenode failover due to unresponsive namenode detected by the 
> zkfc with lot's of WARN messages (5 millions) like this one:
> _org.apache.hadoop.hdfs.StateChange: BLOCK* internalReleaseLease: All 
> existing blocks are COMPLETE, lease removed, file closed._
> On the threaddump taken by the zkfc there are lots of thread blocked due to a 
> lock.
> Looking at the code, there are a lock taken by the LeaseManager.Monitor when 
> some lease must be released. Due to the really big number of lease to be 
> released the namenode has taken too many times to release them blocking all 
> other tasks and making the zkfc thinking that the namenode was not 
> available/stuck.
> The idea of this patch is to limit the number of leased released each time we 
> check for lease so the lock won't be taken for a too long time period.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to