[
https://issues.apache.org/jira/browse/HDFS-10763?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Sangjin Lee updated HDFS-10763:
-------------------------------
Fix Version/s: 2.6.5
Cherry-picked it to 2.6.5 (trivial).
> Open files can leak permanently due to inconsistent lease update
> ----------------------------------------------------------------
>
> Key: HDFS-10763
> URL: https://issues.apache.org/jira/browse/HDFS-10763
> Project: Hadoop HDFS
> Issue Type: Bug
> Affects Versions: 2.7.3, 2.6.4
> Reporter: Kihwal Lee
> Assignee: Kihwal Lee
> Priority: Critical
> Fix For: 2.6.5, 2.7.4, 3.0.0-alpha2
>
> Attachments: HDFS-10763.br27.patch,
> HDFS-10763.branch-2.7.supplement.patch, HDFS-10763.branch-2.7.v2.patch,
> HDFS-10763.patch
>
>
> This can heppen during {{commitBlockSynchronization()}} or a client gives up
> on closing a file after retries.
> From {{finalizeINodeFileUnderConstruction()}}, the lease is removed first and
> then the inode is turned into the closed state. But if any block is not in
> COMPLETE state,
> {{INodeFile#assertAllBlocksComplete()}} will throw an exception. This will
> cause the lease is removed from the lease manager, but not from the inode.
> Since the lease manager does not have a lease for the file, no lease recovery
> will happen for this file. Moreover, this broken state is persisted and
> reconstructed through saving and loading of fsimage. Since no replication is
> scheduled for the blocks for the file, this can cause a data loss and also
> block decommissioning of datanode.
> The lease cannot be manually recovered either. It fails with
> {noformat}
> ...AlreadyBeingCreatedException): Failed to RECOVER_LEASE /xyz/xyz for user1
> on
> 0.0.0.1 because the file is under construction but no leases found.
> at
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.recoverLeaseInternal(FSNamesystem.java:2950)
> ...
> {noformat}
> When a client retries {{close()}}, the same inconsistent state is created,
> but it can work in the next time since {{checkLease()}} only looks at the
> inode, not the lease manager in this case. The close behavior is different if
> HDFS-8999 is activated by setting
> {{dfs.namenode.file.close.num-committed-allowed}} to 1 (unlikely) or 2
> (never).
> In principle, the under-construction feature of an inode and the lease in the
> lease manager should never go out of sync. The fix involves two parts.
> 1) Prevent inconsistent lease updates. We can achieve this by calling
> {{removeLease()}} after checking the block state.
> 2) Avoid reconstructing inconsistent lease states from a fsimage. 1) alone
> does not correct the existing inconsistencies surviving through fsimages.
> This can be done during fsimage loading time by making sure a corresponding
> lease exists for each inode that are with the underconstruction feature.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]