[jira] [Commented] (HDFS-10763) Open files can leak permanently due to inconsistent lease update

Kihwal Lee (JIRA) Thu, 18 Aug 2016 14:30:05 -0700

    [ 
https://issues.apache.org/jira/browse/HDFS-10763?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15427188#comment-15427188
 ]


Kihwal Lee commented on HDFS-10763:
-----------------------------------

The test passes reliably when run on my box.
{noformat}
-------------------------------------------------------
 T E S T S
-------------------------------------------------------
OpenJDK 64-Bit Server VM warning: ignoring option MaxPermSize=768m; support was 
removed in 8.0
Running org.apache.hadoop.hdfs.server.namenode.snapshot.TestRenameWithSnapshots
Tests run: 36, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 194.942 sec
 - in org.apache.hadoop.hdfs.server.namenode.snapshot.TestRenameWithSnapshots

Results :

Tests run: 36, Failures: 0, Errors: 0, Skipped: 0
{noformat}

It failed in precommit due to jvm oom. From the log, it appears that the jvm's 
max heap size is smaller.
{noformat}
INFO  util.GSet (LightWeightGSet.java:computeCapacity(356)) - 1.0% max memory 
918.5 MB = 9.2 MB
{noformat}
This is from my own test run:
{noformat}
INFO  util.GSet (LightWeightGSet.java:computeCapacity(356)) - 1.0% max memory 
3.6 GB = 36.4 MB
{noformat}

We have this in {{hadoop-project/pom.xml}} and verified the forked test jvms 
are running with {{-Xmx4096m}}.
{code:xml}
<maven-surefire-plugin.argLine>-Xmx4096m -XX:MaxPermSize=768m 
-XX:+HeapDumpOnOutOfMemoryError</maven-surefire-plugin.argLine>
{code}
I am guessing that the docker container had a lower memory limit. It looks like 
trunk tests are getting more memory.
{noformat}
INFO  util.GSet (LightWeightGSet.java:computeCapacity(397)) - 1.0% max memory 
1.8 GB = 18.2 MB
{noformat}

> Open files can leak permanently due to inconsistent lease update
> ----------------------------------------------------------------
>
>                 Key: HDFS-10763
>                 URL: https://issues.apache.org/jira/browse/HDFS-10763
>             Project: Hadoop HDFS
>          Issue Type: Bug
>    Affects Versions: 2.7.3, 2.6.4
>            Reporter: Kihwal Lee
>            Assignee: Kihwal Lee
>            Priority: Critical
>             Fix For: 2.7.4, 3.0.0-alpha2
>
>         Attachments: HDFS-10763.br27.patch, 
> HDFS-10763.branch-2.7.supplement.patch, HDFS-10763.branch-2.7.v2.patch, 
> HDFS-10763.patch
>
>
> This can heppen during {{commitBlockSynchronization()}} or a client gives up 
> on closing a file after retries.
> From {{finalizeINodeFileUnderConstruction()}}, the lease is removed first and 
> then the inode is turned into the closed state. But if any block is not in 
> COMPLETE state, 
> {{INodeFile#assertAllBlocksComplete()}} will throw an exception. This will 
> cause the lease is removed from the lease manager, but not from the inode. 
> Since the lease manager does not have a lease for the file, no lease recovery 
> will happen for this file. Moreover, this broken state is persisted and 
> reconstructed through saving and loading of fsimage. Since no replication is 
> scheduled for the blocks for the file, this can cause a data loss and also 
> block decommissioning of datanode.
> The lease cannot be manually recovered either. It fails with
> {noformat}
> ...AlreadyBeingCreatedException): Failed to RECOVER_LEASE /xyz/xyz for user1 
> on
>  0.0.0.1 because the file is under construction but no leases found.
>       at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.recoverLeaseInternal(FSNamesystem.java:2950)
> ...
> {noformat}
> When a client retries {{close()}}, the same inconsistent state is created, 
> but it can work in the next time since {{checkLease()}} only looks at the 
> inode, not the lease manager in this case. The close behavior is different if 
> HDFS-8999 is activated by setting 
> {{dfs.namenode.file.close.num-committed-allowed}} to 1 (unlikely) or 2 
> (never). 
> In principle, the under-construction feature of an inode and the lease in the 
> lease manager should never go out of sync. The fix involves two parts.
> 1) Prevent inconsistent lease updates. We can achieve this by calling 
> {{removeLease()}} after checking the block state. 
> 2) Avoid reconstructing inconsistent lease states from a fsimage. 1) alone 
> does not correct the existing inconsistencies surviving through fsimages.  
> This can be done during fsimage loading time by making sure a corresponding 
> lease exists for each inode that are with the underconstruction feature. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (HDFS-10763) Open files can leak permanently due to inconsistent lease update

Reply via email to