[
https://issues.apache.org/jira/browse/HDFS-10220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15218097#comment-15218097
]
Ravi Prakash commented on HDFS-10220:
-------------------------------------
{code}import static org.apache.hadoop.hdfs.DFSConfigKeys.*;{code}
Could you please import classes explicitly?
{code}
/** Number max of path for released lease each time Monitor check for expired
lease */
private final long maxPathRealeaseExpiredLease;
{code}
has grammar and spelling errors. I'd suggest
{code}
/** Maximum number of files whose lease will be released in one iteration of
checkLeases() */
private final long maxPathReleaseExpiredLease; // <-- Release was misspelt
here
{code}
{code}
Configuration conf = new Configuration();
this.maxPathRealeaseExpiredLease =
conf.getLong(DFS_NAMENODE_MAX_PATH_RELEASE_EXPIRED_LEASE_KEY,
DFS_NAMENODE_MAX_PATH_RELEASE_EXPIRED_LEASE_DEFAULT);
{code}
I'm fine with not getting {{maxPathRealeaseExpiredLease}} from configuration
and hardcoding it to your default value of 100000. If you want to keep the
configuration, I'd suggest changing
{{dfs.namenode.max-path-release-expired-lease}} to
{{dfs.namenode.lease-manager.max-released-leases-per-iteration}} .
Please rename {{nPathReleaseExpiredLease}} to {{numLeasesReleased}}
{code}// Stop releasing lease as a lock is hold after few iterations{code}
change to {code} //Relinquish FSNamesystem lock after
maxPathRealeaseExpiredLease iterations {code}
{code} LOG.warn("Breaking out of checkLeases() after " +
nPathReleaseExpiredLease + " iterations",
new Throwable("Too long loop with a lock"));
{code}
Its unnecessary to log an exception.
For the purposes of test, you can add a method which changes the
maxPathRealeaseExpiredLease and annotate it Could you please also also rename
{{testCheckLeaseNotInfiniteLoop}} and change its documentation.
> Namenode failover due to too long loking in LeaseManager.Monitor
> ----------------------------------------------------------------
>
> Key: HDFS-10220
> URL: https://issues.apache.org/jira/browse/HDFS-10220
> Project: Hadoop HDFS
> Issue Type: Bug
> Components: namenode
> Reporter: Nicolas Fraison
> Priority: Minor
> Attachments: HADOOP-10220.001.patch, threaddump_zkfc.txt
>
>
> I have faced a namenode failover due to unresponsive namenode detected by the
> zkfc with lot's of WARN messages (5 millions) like this one:
> _org.apache.hadoop.hdfs.StateChange: BLOCK* internalReleaseLease: All
> existing blocks are COMPLETE, lease removed, file closed._
> On the threaddump taken by the zkfc there are lots of thread blocked due to a
> lock.
> Looking at the code, there are a lock taken by the LeaseManager.Monitor when
> some lease must be released. Due to the really big number of lease to be
> released the namenode has taken too many times to release them blocking all
> other tasks and making the zkfc thinking that the namenode was not
> available/stuck.
> The idea of this patch is to limit the number of leased released each time we
> check for lease so the lock won't be taken for a too long time period.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)