[
https://issues.apache.org/jira/browse/HDFS-15209?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Íñigo Goiri updated HDFS-15209:
-------------------------------
Target Version/s: 3.1.3, 3.1.2 (was: 3.1.2, 3.1.3)
Resolution: Duplicate
Status: Resolved (was: Patch Available)
> Lease recovery: namenode not able to commitBlockSynchronization if client
> comes back and closes the file beforehand
> -------------------------------------------------------------------------------------------------------------------
>
> Key: HDFS-15209
> URL: https://issues.apache.org/jira/browse/HDFS-15209
> Project: Hadoop HDFS
> Issue Type: Bug
> Components: namenode
> Affects Versions: 3.1.2, 3.1.3
> Reporter: Ye Ni
> Assignee: Ye Ni
> Priority: Major
> Attachments: HDFS-15209.000.patch, HDFS-15209.001.patch
>
>
> We observed a case, client closes the file after soft lease recovery already
> started but before namenode commitBlockSynchronization.
> This leads to commitBlockSynchronization failure with error below, which
> requires either the file isn't closed or the last block isn't in complete
> state.
> As a result, we will have corrupted replicas by genstamp mismatch, since
> datanodes may have finished block recovery with a new genstamp.
> This could happen when client delays a lot on write and comes back when lease
> recovery already happens by write/append/truncate request from other client.
> Repro steps:
> # Client #1 finishes writing a file, but hasn't closed yet.
> # Client #1 doesn't renew lease for a soft lease period.
> # Another client #2 appends the same file.
> # Soft lease recovery begins.
> # Block recovery in datanodes finishes.
> # Client #1 comes back to close the file.
> # Close succeeds since Client #1 still hold the lease (lease isn't revoked
> until close in soft recovery).
> # Namenode tries to commitBlockSynchronization with error log below.
> # Namenode and datanodes have different genstamp for this file, resulting in
> corrupted block.
> Fix:
> Check the state of the last block when completing the file. If it's under
> recovery, it means lease recovery started, but namenode hasn't
> commitBlockSynchronization yet.
> In this case, don't complete file.
>
> {code:java}
> 2020-02-22 22:47:04,698 INFO [IPC Server handler 32 on 8020]
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem:
> commitBlockSynchronization(oldBlock=BP-269461681-10.65.230.22-1554624547020:blk_2642650669_3063725879,
> newgenerationstamp=3063765480, newlength=262144000,
> newtargets=[25.65.180.47:10010, 25.65.161.162:10010, 100.101.88.162:10010],
> closeFile=true, deleteBlock=false)
> 2020-02-22 22:47:04,698 DEBUG [IPC Server handler 32 on 8020]
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Unexpected block
> (=BP-269461681-10.65.230.22-1554624547020:blk_2642650669_3063725879) since
> the file
> (=132269111992796228.data.637180347427616457.tmp.132269136349107823.copying)
> is not under construction
> {code}
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]