[jira] [Commented] (HDFS-11499) Decommissioning stuck because of failing recovery

Masatake Iwasaki (JIRA) Tue, 07 Mar 2017 18:36:53 -0800

    [ 
https://issues.apache.org/jira/browse/HDFS-11499?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15900592#comment-15900592
 ]


Masatake Iwasaki commented on HDFS-11499:
-----------------------------------------

The timeout seems to be relevant since replica recovery was not attempted after 
first 30 seconds in failed test case.
{noformat}
$ grep 'initReplicaRecovery:' 
org.apache.hadoop.hdfs.TestDecommission-output.txt.failed
2017-03-07 14:13:35,095 
[org.apache.hadoop.hdfs.server.datanode.BlockRecoveryWorker$1@7dcda518] INFO  
impl.FsDatasetImpl (FsDatasetImpl.java:initReplicaRecoveryImpl(2382)) - 
initReplicaRecovery: blk_1073741826_1002, recoveryId=1004, 
replica=FinalizedReplica, blk_1073741826_1002, FINALIZED
2017-03-07 14:13:35,096 
[org.apache.hadoop.hdfs.server.datanode.BlockRecoveryWorker$1@7dcda518] INFO  
impl.FsDatasetImpl (FsDatasetImpl.java:initReplicaRecoveryImpl(2440)) - 
initReplicaRecovery: changing replica state for blk_1073741826_1002 from 
FINALIZED to RUR
...snip
2017-03-07 14:14:03,092 
[org.apache.hadoop.hdfs.server.datanode.BlockRecoveryWorker$1@5c8628b1] INFO  
impl.FsDatasetImpl (FsDatasetImpl.java:initReplicaRecoveryImpl(2382)) - 
initReplicaRecovery: blk_1073741826_1002, recoveryId=1018, 
replica=ReplicaUnderRecovery, blk_1073741826_1002, RUR
2017-03-07 14:14:03,092 
[org.apache.hadoop.hdfs.server.datanode.BlockRecoveryWorker$1@5c8628b1] INFO  
impl.FsDatasetImpl (FsDatasetImpl.java:initReplicaRecoveryImpl(2433)) - 
initReplicaRecovery: update recovery id for blk_1073741826_1002 from 1017 to 
1018
{noformat}

{noformat}
$ tail -n2 org.apache.hadoop.hdfs.TestDecommission-output.txt.failed
2017-03-07 14:19:26,875 [main] INFO  impl.MetricsSystemImpl 
(MetricsSystemImpl.java:shutdown(607)) - DataNode metrics system shutdown 
complete.
2017-03-07 14:19:26,987 [Thread-11] INFO  hdfs.AdminStatesBaseTest 
(AdminStatesBaseTest.java:waitNodeState(342)) - Waiting for node 
127.0.0.1:43314 to change state to Decommissioned current state: Decommission 
In Progress
{noformat}

DFS_NAMENODE_REPLICATION_PENDING_TIMEOUT_SEC_KEY was replace by 
DFS_NAMENODE_RECONSTRUCTION_PENDING_TIMEOUT_SEC_KEY while keeping effective 
default value based on the description of HDFS-10219.
{noformat}
  public static final String  
DFS_NAMENODE_RECONSTRUCTION_PENDING_TIMEOUT_SEC_KEY =
      "dfs.namenode.reconstruction.pending.timeout-sec";
  public static final int
      DFS_NAMENODE_RECONSTRUCTION_PENDING_TIMEOUT_SEC_DEFAULT = 300;
{noformat}

Trying to set the timeout to 4 in {{AdminStatesBaseTest#setup}} and seeing 
effect.


> Decommissioning stuck because of failing recovery
> -------------------------------------------------
>
>                 Key: HDFS-11499
>                 URL: https://issues.apache.org/jira/browse/HDFS-11499
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: hdfs, namenode
>    Affects Versions: 2.7.1, 2.7.2, 2.7.3, 3.0.0-alpha2
>            Reporter: Lukas Majercak
>            Assignee: Lukas Majercak
>              Labels: blockmanagement, decommission, recovery
>             Fix For: 3.0.0-alpha3
>
>         Attachments: HDFS-11499.02.patch, HDFS-11499.03.patch, 
> HDFS-11499.04.patch, HDFS-11499.patch
>
>
> Block recovery will fail to finalize the file if the locations of the last, 
> incomplete block are being decommissioned. Vice versa, the decommissioning 
> will be stuck, waiting for the last block to be completed.
> {code:xml}
> org.apache.hadoop.ipc.RemoteException(java.lang.IllegalStateException): 
> Failed to finalize INodeFile testRecoveryFile since blocks[255] is 
> non-complete, where blocks=[blk_1073741825_1001, blk_1073741826_1002...
> {code}
> The fix is to count replicas on decommissioning nodes when completing last 
> block in BlockManager.commitOrCompleteLastBlock, as we know that the 
> DecommissionManager will not decommission a node that has UC blocks.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (HDFS-11499) Decommissioning stuck because of failing recovery

Reply via email to