[jira] [Commented] (HDFS-10220) Namenode failover due to too long loking in LeaseManager.Monitor

Vinayakumar B (JIRA) Mon, 11 Apr 2016 10:16:32 -0700

    [ 
https://issues.apache.org/jira/browse/HDFS-10220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15235511#comment-15235511
 ]


Vinayakumar B commented on HDFS-10220:
--------------------------------------

Thanks @Nicolas Fraison for the patch.
Looks almost good. Some comments below.

1. "DFS_NAMENODE_MAX_FILES_CHECKED_PER_ITERATION_KEY" could be 
"DFS_NAMENODE_LEASE_MAX_FILES_CHECKED_PER_ITERATION_KEY", same for DEFAULT as 
well.
2. {code}
          /** Maximum number of files whose lease will be released in one 
iteration
           *  of LeaseManager.checkLeases()
           */
          private final int maxFilesLeasesCheckedForRelease;
{code}
Comment could be corrected, saying 'number of files checked to release lease in 
one iteration'
3. {code}
                  if (filesLeasesChecked >= 
fsnamesystem.getMaxFilesLeasesCheckedForRelease()) {
                    removeFilesInLease(leaseToCheck, removing);
                    LOG.warn("Breaking out of checkLeases() after " + 
filesLeasesChecked
                      + "  file leases checked.");
                    break;
                  }
{code}
Here, {{removeFilesInLease(leaseToCheck, removing);}} may not be required. as 
just breaking out from for-loop will do the removal outside while-loop. So 
extracting to separate method also may not be necessary.

4. {code}
-      Long[] leaseINodeIds = files.toArray(new Long[files.size()]);
+      final int numLeasePaths = files.size();
+      Long[] leaseINodeIds = files.toArray(new Long[numLeasePaths]);
{code}
This also may not be required.

5. checkstyle comments can be addressed.

+1, once these are addressed.

> Namenode failover due to too long loking in LeaseManager.Monitor
> ----------------------------------------------------------------
>
>                 Key: HDFS-10220
>                 URL: https://issues.apache.org/jira/browse/HDFS-10220
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: namenode
>            Reporter: Nicolas Fraison
>            Assignee: Nicolas Fraison
>            Priority: Minor
>         Attachments: HADOOP-10220.001.patch, HADOOP-10220.002.patch, 
> HADOOP-10220.003.patch, threaddump_zkfc.txt
>
>
> I have faced a namenode failover due to unresponsive namenode detected by the 
> zkfc with lot's of WARN messages (5 millions) like this one:
> _org.apache.hadoop.hdfs.StateChange: BLOCK* internalReleaseLease: All 
> existing blocks are COMPLETE, lease removed, file closed._
> On the threaddump taken by the zkfc there are lots of thread blocked due to a 
> lock.
> Looking at the code, there are a lock taken by the LeaseManager.Monitor when 
> some lease must be released. Due to the really big number of lease to be 
> released the namenode has taken too many times to release them blocking all 
> other tasks and making the zkfc thinking that the namenode was not 
> available/stuck.
> The idea of this patch is to limit the number of leased released each time we 
> check for lease so the lock won't be taken for a too long time period.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HDFS-10220) Namenode failover due to too long loking in LeaseManager.Monitor

Reply via email to