[jira] [Commented] (HDFS-15725) Lease Recovery never completes for a committed block which the DNs never finalize

Stephen O'Donnell (Jira) Fri, 11 Dec 2020 15:16:07 -0800


    [ 
https://issues.apache.org/jira/browse/HDFS-15725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17248242#comment-17248242
 ]


Stephen O'Donnell commented on HDFS-15725:
------------------------------------------

I have committed this to trunk and branch-3.3.

Branch-3.2 had two very minor conflicts on the tests. This call (appears twice 
in the tests):

{code}
    HdfsFileStatus stat = client.getNamenode()
        .create(file, new FsPermission("777"), client.clientName,
            new EnumSetWritable<CreateFlag>(EnumSet.of(CreateFlag.CREATE)),
            true, (short) repFactor, 1024 * 1024 * 128L,
            new CryptoProtocolVersion[0], null, null);
{code}

Has to have the final parameter removed:

{code}
    HdfsFileStatus stat = client.getNamenode()
        .create(file, new FsPermission("777"), client.clientName,
            new EnumSetWritable<CreateFlag>(EnumSet.of(CreateFlag.CREATE)),
            true, (short) repFactor, 1024 * 1024 * 128L,
            new CryptoProtocolVersion[0], null);
{code}

Failing tests pass locally. The branch-3.2 patch also applies cleanly to 
branch-3.1, so I think we need just one more patch for 2.10.

> Lease Recovery never completes for a committed block which the DNs never 
> finalize
> ---------------------------------------------------------------------------------
>
>                 Key: HDFS-15725
>                 URL: https://issues.apache.org/jira/browse/HDFS-15725
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: namenode
>    Affects Versions: 3.4.0
>            Reporter: Stephen O'Donnell
>            Assignee: Stephen O'Donnell
>            Priority: Major
>         Attachments: HDFS-15725.001.patch, HDFS-15725.002.patch, 
> HDFS-15725.003.patch, HDFS-15725.branch-3.2.001.patch, 
> lease_recovery_2_10.patch
>
>
> It a very rare condition, the HDFS client process can get killed right at the 
> time it is completing a block / file.
> The client sends the "complete" call to the namenode, moving the block into a 
> committed state, but it dies before it can send the final packet to the 
> Datanodes telling them to finalize the block.
> This means the blocks are stuck on the datanodes in RBW state and nothing 
> will ever tell them to move out of that state.
> The namenode / lease manager will retry forever to close the file, but it 
> will always complain it is waiting for blocks to reach minimal replication.
> I have a simple test and patch to fix this, but I think it warrants some 
> discussion on whether this is the correct thing to do, or if I need to put 
> the fix behind a config switch.
> My idea, is that if lease recovery occurs, and the block is still waiting on 
> "minimal replication", just put the file back to UNDER_CONSTRUCTION so that 
> on the next lease recovery attempt, BLOCK RECOVERY will happen, close the 
> file and move the replicas to FINALIZED.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Commented] (HDFS-15725) Lease Recovery never completes for a committed block which the DNs never finalize

Reply via email to