[jira] [Commented] (HDFS-10220) Namenode failover due to too long loking in LeaseManager.Monitor

Ravi Prakash (JIRA) Wed, 30 Mar 2016 08:02:17 -0700

    [ 
https://issues.apache.org/jira/browse/HDFS-10220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15218097#comment-15218097
 ]


Ravi Prakash commented on HDFS-10220:
-------------------------------------

{code}import static org.apache.hadoop.hdfs.DFSConfigKeys.*;{code}
Could you please import classes explicitly?


{code}
  /** Number max of path for released lease each time Monitor check for expired 
lease */
  private final long maxPathRealeaseExpiredLease;
{code}
has grammar and spelling errors. I'd suggest
{code}
  /** Maximum number of files whose lease will be released in one iteration of 
checkLeases() */
  private final long maxPathReleaseExpiredLease; // <-- Release was misspelt 
here
{code}


{code}
    Configuration conf = new Configuration();
    this.maxPathRealeaseExpiredLease = 
conf.getLong(DFS_NAMENODE_MAX_PATH_RELEASE_EXPIRED_LEASE_KEY,
      DFS_NAMENODE_MAX_PATH_RELEASE_EXPIRED_LEASE_DEFAULT);
{code}
I'm fine with not getting {{maxPathRealeaseExpiredLease}} from configuration 
and hardcoding it to your default value of 100000. If you want to keep the 
configuration, I'd suggest changing 
{{dfs.namenode.max-path-release-expired-lease}} to 
{{dfs.namenode.lease-manager.max-released-leases-per-iteration}} .


Please rename {{nPathReleaseExpiredLease}} to {{numLeasesReleased}}


{code}// Stop releasing lease as a lock is hold after few iterations{code}
change to {code} //Relinquish FSNamesystem lock after 
maxPathRealeaseExpiredLease iterations {code}


{code}        LOG.warn("Breaking out of checkLeases() after " + 
nPathReleaseExpiredLease + " iterations",
          new Throwable("Too long loop with a lock"));
{code}
Its unnecessary to log an exception.

For the purposes of test, you can add a method which changes the 
maxPathRealeaseExpiredLease and annotate it  Could you please also also rename 
{{testCheckLeaseNotInfiniteLoop}} and change its documentation.

> Namenode failover due to too long loking in LeaseManager.Monitor
> ----------------------------------------------------------------
>
>                 Key: HDFS-10220
>                 URL: https://issues.apache.org/jira/browse/HDFS-10220
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: namenode
>            Reporter: Nicolas Fraison
>            Priority: Minor
>         Attachments: HADOOP-10220.001.patch, threaddump_zkfc.txt
>
>
> I have faced a namenode failover due to unresponsive namenode detected by the 
> zkfc with lot's of WARN messages (5 millions) like this one:
> _org.apache.hadoop.hdfs.StateChange: BLOCK* internalReleaseLease: All 
> existing blocks are COMPLETE, lease removed, file closed._
> On the threaddump taken by the zkfc there are lots of thread blocked due to a 
> lock.
> Looking at the code, there are a lock taken by the LeaseManager.Monitor when 
> some lease must be released. Due to the really big number of lease to be 
> released the namenode has taken too many times to release them blocking all 
> other tasks and making the zkfc thinking that the namenode was not 
> available/stuck.
> The idea of this patch is to limit the number of leased released each time we 
> check for lease so the lock won't be taken for a too long time period.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HDFS-10220) Namenode failover due to too long loking in LeaseManager.Monitor

Reply via email to