[jira] [Commented] (HDFS-11499) Decommissioning stuck because of failing recovery

Masatake Iwasaki (JIRA) Tue, 07 Mar 2017 22:52:02 -0800

    [ 
https://issues.apache.org/jira/browse/HDFS-11499?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15900808#comment-15900808
 ]


Masatake Iwasaki commented on HDFS-11499:
-----------------------------------------

a comment about {{testDecommissionWithOpenFileAndDatanodeFailing}}.

{noformat}
678         // Kill one of the datanodes of the last block
679         getCluster().stopDataNode(lastBlockLocations[0].getName());
{noformat}

I think this is misleading and makes test time unnecessary long. If my 
understanding is correct, the issue is reproduced only if nodes are in 
decommissioning state while trying to complete the last block.

How about make nodes decommissioning first then invoke lease recovery? like

{noformat}
    // Decommission all nodes of the last block
    ArrayList<String> toDecom = new ArrayList<>();
    for (DatanodeInfo dnDecom : lastBlockLocations) {
      toDecom.add(dnDecom.getXferAddr());
    }
    initExcludeHosts(toDecom);
    refreshNodes(0);

    // Make sure hard lease expires
    getCluster().setLeasePeriod(300L, 300L);
    Thread.sleep(2 * BLOCKREPORT_INTERVAL_MSEC);

    for (DatanodeInfo dnDecom : lastBlockLocations) {
      DatanodeInfo datanode = NameNodeAdapter.getDatanode(
          getCluster().getNamesystem(), dnDecom);
      waitNodeState(datanode, AdminStates.DECOMMISSIONED);
    }
{noformat}

Stopping the datanode causes connection failure to the dead node and retry on 
replica recovery and just makes it highly probable that nodes are in 
decommissioning state before the last block is completed.


> Decommissioning stuck because of failing recovery
> -------------------------------------------------
>
>                 Key: HDFS-11499
>                 URL: https://issues.apache.org/jira/browse/HDFS-11499
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: hdfs, namenode
>    Affects Versions: 2.7.1, 2.7.2, 2.7.3, 3.0.0-alpha2
>            Reporter: Lukas Majercak
>            Assignee: Lukas Majercak
>              Labels: blockmanagement, decommission, recovery
>             Fix For: 3.0.0-alpha3
>
>         Attachments: HDFS-11499.02.patch, HDFS-11499.03.patch, 
> HDFS-11499.04.patch, HDFS-11499.patch
>
>
> Block recovery will fail to finalize the file if the locations of the last, 
> incomplete block are being decommissioned. Vice versa, the decommissioning 
> will be stuck, waiting for the last block to be completed.
> {code:xml}
> org.apache.hadoop.ipc.RemoteException(java.lang.IllegalStateException): 
> Failed to finalize INodeFile testRecoveryFile since blocks[255] is 
> non-complete, where blocks=[blk_1073741825_1001, blk_1073741826_1002...
> {code}
> The fix is to count replicas on decommissioning nodes when completing last 
> block in BlockManager.commitOrCompleteLastBlock, as we know that the 
> DecommissionManager will not decommission a node that has UC blocks.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (HDFS-11499) Decommissioning stuck because of failing recovery

Reply via email to