[jira] [Commented] (HDFS-10220) Namenode failover due to too long loking in LeaseManager.Monitor

Nicolas Fraison (JIRA) Sun, 01 May 2016 06:37:23 -0700

    [ 
https://issues.apache.org/jira/browse/HDFS-10220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15265770#comment-15265770
 ]


Nicolas Fraison commented on HDFS-10220:
----------------------------------------

[~raviprak] in fact changing from *Lease leaseToCheck = sortedLeases.poll();* 
to *Lease leaseToCheck = sortedLeases.peek();* seems better.
But I don't think that we had to add the lease to *removing* in case of 
*completed* as it has already been removed by 
*fsnamesystem.internalReleaseLease* in *finalizeINodeFileUnderConstruction*.
I also think that we should keep the *removing.add(id);* in case of an 
exception, as they seems to be raised (and removed) for good reason:
{code}
      // Only the last and the penultimate blocks may be in non COMPLETE state.
      // If the penultimate block is not COMPLETE, then it must be COMMITTED.
      .... 
      throw new IOException(message);
{code}
and
{code}
      // Cannot close file right now, since some blocks 
      // are not yet minimally replicated.
      // This may potentially cause infinite loop in lease recovery
      // if there are no valid replicas on data-nodes.
      ...
      throw new AlreadyBeingCreatedException(message);
{code}

What is your feeling about this?

> Namenode failover due to too long loking in LeaseManager.Monitor
> ----------------------------------------------------------------
>
>                 Key: HDFS-10220
>                 URL: https://issues.apache.org/jira/browse/HDFS-10220
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: namenode
>            Reporter: Nicolas Fraison
>            Assignee: Nicolas Fraison
>            Priority: Minor
>         Attachments: HADOOP-10220.001.patch, HADOOP-10220.002.patch, 
> HADOOP-10220.003.patch, HADOOP-10220.004.patch, HADOOP-10220.005.patch, 
> threaddump_zkfc.txt
>
>
> I have faced a namenode failover due to unresponsive namenode detected by the 
> zkfc with lot's of WARN messages (5 millions) like this one:
> _org.apache.hadoop.hdfs.StateChange: BLOCK* internalReleaseLease: All 
> existing blocks are COMPLETE, lease removed, file closed._
> On the threaddump taken by the zkfc there are lots of thread blocked due to a 
> lock.
> Looking at the code, there are a lock taken by the LeaseManager.Monitor when 
> some lease must be released. Due to the really big number of lease to be 
> released the namenode has taken too many times to release them blocking all 
> other tasks and making the zkfc thinking that the namenode was not 
> available/stuck.
> The idea of this patch is to limit the number of leased released each time we 
> check for lease so the lock won't be taken for a too long time period.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (HDFS-10220) Namenode failover due to too long loking in LeaseManager.Monitor

Reply via email to