[
https://issues.apache.org/jira/browse/HDFS-10763?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15427188#comment-15427188
]
Kihwal Lee commented on HDFS-10763:
-----------------------------------
The test passes reliably when run on my box.
{noformat}
-------------------------------------------------------
T E S T S
-------------------------------------------------------
OpenJDK 64-Bit Server VM warning: ignoring option MaxPermSize=768m; support was
removed in 8.0
Running org.apache.hadoop.hdfs.server.namenode.snapshot.TestRenameWithSnapshots
Tests run: 36, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 194.942 sec
- in org.apache.hadoop.hdfs.server.namenode.snapshot.TestRenameWithSnapshots
Results :
Tests run: 36, Failures: 0, Errors: 0, Skipped: 0
{noformat}
It failed in precommit due to jvm oom. From the log, it appears that the jvm's
max heap size is smaller.
{noformat}
INFO util.GSet (LightWeightGSet.java:computeCapacity(356)) - 1.0% max memory
918.5 MB = 9.2 MB
{noformat}
This is from my own test run:
{noformat}
INFO util.GSet (LightWeightGSet.java:computeCapacity(356)) - 1.0% max memory
3.6 GB = 36.4 MB
{noformat}
We have this in {{hadoop-project/pom.xml}} and verified the forked test jvms
are running with {{-Xmx4096m}}.
{code:xml}
<maven-surefire-plugin.argLine>-Xmx4096m -XX:MaxPermSize=768m
-XX:+HeapDumpOnOutOfMemoryError</maven-surefire-plugin.argLine>
{code}
I am guessing that the docker container had a lower memory limit. It looks like
trunk tests are getting more memory.
{noformat}
INFO util.GSet (LightWeightGSet.java:computeCapacity(397)) - 1.0% max memory
1.8 GB = 18.2 MB
{noformat}
> Open files can leak permanently due to inconsistent lease update
> ----------------------------------------------------------------
>
> Key: HDFS-10763
> URL: https://issues.apache.org/jira/browse/HDFS-10763
> Project: Hadoop HDFS
> Issue Type: Bug
> Affects Versions: 2.7.3, 2.6.4
> Reporter: Kihwal Lee
> Assignee: Kihwal Lee
> Priority: Critical
> Fix For: 2.7.4, 3.0.0-alpha2
>
> Attachments: HDFS-10763.br27.patch,
> HDFS-10763.branch-2.7.supplement.patch, HDFS-10763.branch-2.7.v2.patch,
> HDFS-10763.patch
>
>
> This can heppen during {{commitBlockSynchronization()}} or a client gives up
> on closing a file after retries.
> From {{finalizeINodeFileUnderConstruction()}}, the lease is removed first and
> then the inode is turned into the closed state. But if any block is not in
> COMPLETE state,
> {{INodeFile#assertAllBlocksComplete()}} will throw an exception. This will
> cause the lease is removed from the lease manager, but not from the inode.
> Since the lease manager does not have a lease for the file, no lease recovery
> will happen for this file. Moreover, this broken state is persisted and
> reconstructed through saving and loading of fsimage. Since no replication is
> scheduled for the blocks for the file, this can cause a data loss and also
> block decommissioning of datanode.
> The lease cannot be manually recovered either. It fails with
> {noformat}
> ...AlreadyBeingCreatedException): Failed to RECOVER_LEASE /xyz/xyz for user1
> on
> 0.0.0.1 because the file is under construction but no leases found.
> at
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.recoverLeaseInternal(FSNamesystem.java:2950)
> ...
> {noformat}
> When a client retries {{close()}}, the same inconsistent state is created,
> but it can work in the next time since {{checkLease()}} only looks at the
> inode, not the lease manager in this case. The close behavior is different if
> HDFS-8999 is activated by setting
> {{dfs.namenode.file.close.num-committed-allowed}} to 1 (unlikely) or 2
> (never).
> In principle, the under-construction feature of an inode and the lease in the
> lease manager should never go out of sync. The fix involves two parts.
> 1) Prevent inconsistent lease updates. We can achieve this by calling
> {{removeLease()}} after checking the block state.
> 2) Avoid reconstructing inconsistent lease states from a fsimage. 1) alone
> does not correct the existing inconsistencies surviving through fsimages.
> This can be done during fsimage loading time by making sure a corresponding
> lease exists for each inode that are with the underconstruction feature.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]