[ 
https://issues.apache.org/jira/browse/HDFS-15209?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Íñigo Goiri updated HDFS-15209:
-------------------------------
    Target Version/s: 3.1.3, 3.1.2  (was: 3.1.2, 3.1.3)
          Resolution: Duplicate
              Status: Resolved  (was: Patch Available)

> Lease recovery: namenode not able to commitBlockSynchronization if client 
> comes back and closes the file beforehand
> -------------------------------------------------------------------------------------------------------------------
>
>                 Key: HDFS-15209
>                 URL: https://issues.apache.org/jira/browse/HDFS-15209
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: namenode
>    Affects Versions: 3.1.2, 3.1.3
>            Reporter: Ye Ni
>            Assignee: Ye Ni
>            Priority: Major
>         Attachments: HDFS-15209.000.patch, HDFS-15209.001.patch
>
>
> We observed a case, client closes the file after soft lease recovery already 
> started but before namenode commitBlockSynchronization.
> This leads to commitBlockSynchronization failure with error below, which 
> requires either the file isn't closed or the last block isn't in complete 
> state.
> As a result, we will have corrupted replicas by genstamp mismatch, since 
> datanodes may have finished block recovery with a new genstamp.
> This could happen when client delays a lot on write and comes back when lease 
> recovery already happens by write/append/truncate request from other client.
> Repro steps:
>  # Client #1 finishes writing a file, but hasn't closed yet.
>  # Client #1 doesn't renew lease for a soft lease period.
>  # Another client #2 appends the same file.
>  # Soft lease recovery begins.
>  # Block recovery in datanodes finishes.
>  # Client #1 comes back to close the file.
>  # Close succeeds since Client #1 still hold the lease (lease isn't revoked 
> until close in soft recovery).
>  # Namenode tries to commitBlockSynchronization with error log below.
>  # Namenode and datanodes have different genstamp for this file, resulting in 
> corrupted block.
> Fix:
> Check the state of the last block when completing the file. If it's under 
> recovery, it means lease recovery started, but namenode hasn't 
> commitBlockSynchronization yet.
> In this case, don't complete file.
>  
> {code:java}
> 2020-02-22 22:47:04,698 INFO [IPC Server handler 32 on 8020] 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem: 
> commitBlockSynchronization(oldBlock=BP-269461681-10.65.230.22-1554624547020:blk_2642650669_3063725879,
>  newgenerationstamp=3063765480, newlength=262144000, 
> newtargets=[25.65.180.47:10010, 25.65.161.162:10010, 100.101.88.162:10010], 
> closeFile=true, deleteBlock=false)
> 2020-02-22 22:47:04,698 DEBUG [IPC Server handler 32 on 8020] 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Unexpected block 
> (=BP-269461681-10.65.230.22-1554624547020:blk_2642650669_3063725879) since 
> the file 
> (=132269111992796228.data.637180347427616457.tmp.132269136349107823.copying) 
> is not under construction
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

Reply via email to