[
https://issues.apache.org/jira/browse/HDFS-11499?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15900592#comment-15900592
]
Masatake Iwasaki commented on HDFS-11499:
-----------------------------------------
The timeout seems to be relevant since replica recovery was not attempted after
first 30 seconds in failed test case.
{noformat}
$ grep 'initReplicaRecovery:'
org.apache.hadoop.hdfs.TestDecommission-output.txt.failed
2017-03-07 14:13:35,095
[org.apache.hadoop.hdfs.server.datanode.BlockRecoveryWorker$1@7dcda518] INFO
impl.FsDatasetImpl (FsDatasetImpl.java:initReplicaRecoveryImpl(2382)) -
initReplicaRecovery: blk_1073741826_1002, recoveryId=1004,
replica=FinalizedReplica, blk_1073741826_1002, FINALIZED
2017-03-07 14:13:35,096
[org.apache.hadoop.hdfs.server.datanode.BlockRecoveryWorker$1@7dcda518] INFO
impl.FsDatasetImpl (FsDatasetImpl.java:initReplicaRecoveryImpl(2440)) -
initReplicaRecovery: changing replica state for blk_1073741826_1002 from
FINALIZED to RUR
...snip
2017-03-07 14:14:03,092
[org.apache.hadoop.hdfs.server.datanode.BlockRecoveryWorker$1@5c8628b1] INFO
impl.FsDatasetImpl (FsDatasetImpl.java:initReplicaRecoveryImpl(2382)) -
initReplicaRecovery: blk_1073741826_1002, recoveryId=1018,
replica=ReplicaUnderRecovery, blk_1073741826_1002, RUR
2017-03-07 14:14:03,092
[org.apache.hadoop.hdfs.server.datanode.BlockRecoveryWorker$1@5c8628b1] INFO
impl.FsDatasetImpl (FsDatasetImpl.java:initReplicaRecoveryImpl(2433)) -
initReplicaRecovery: update recovery id for blk_1073741826_1002 from 1017 to
1018
{noformat}
{noformat}
$ tail -n2 org.apache.hadoop.hdfs.TestDecommission-output.txt.failed
2017-03-07 14:19:26,875 [main] INFO impl.MetricsSystemImpl
(MetricsSystemImpl.java:shutdown(607)) - DataNode metrics system shutdown
complete.
2017-03-07 14:19:26,987 [Thread-11] INFO hdfs.AdminStatesBaseTest
(AdminStatesBaseTest.java:waitNodeState(342)) - Waiting for node
127.0.0.1:43314 to change state to Decommissioned current state: Decommission
In Progress
{noformat}
DFS_NAMENODE_REPLICATION_PENDING_TIMEOUT_SEC_KEY was replace by
DFS_NAMENODE_RECONSTRUCTION_PENDING_TIMEOUT_SEC_KEY while keeping effective
default value based on the description of HDFS-10219.
{noformat}
public static final String
DFS_NAMENODE_RECONSTRUCTION_PENDING_TIMEOUT_SEC_KEY =
"dfs.namenode.reconstruction.pending.timeout-sec";
public static final int
DFS_NAMENODE_RECONSTRUCTION_PENDING_TIMEOUT_SEC_DEFAULT = 300;
{noformat}
Trying to set the timeout to 4 in {{AdminStatesBaseTest#setup}} and seeing
effect.
> Decommissioning stuck because of failing recovery
> -------------------------------------------------
>
> Key: HDFS-11499
> URL: https://issues.apache.org/jira/browse/HDFS-11499
> Project: Hadoop HDFS
> Issue Type: Bug
> Components: hdfs, namenode
> Affects Versions: 2.7.1, 2.7.2, 2.7.3, 3.0.0-alpha2
> Reporter: Lukas Majercak
> Assignee: Lukas Majercak
> Labels: blockmanagement, decommission, recovery
> Fix For: 3.0.0-alpha3
>
> Attachments: HDFS-11499.02.patch, HDFS-11499.03.patch,
> HDFS-11499.04.patch, HDFS-11499.patch
>
>
> Block recovery will fail to finalize the file if the locations of the last,
> incomplete block are being decommissioned. Vice versa, the decommissioning
> will be stuck, waiting for the last block to be completed.
> {code:xml}
> org.apache.hadoop.ipc.RemoteException(java.lang.IllegalStateException):
> Failed to finalize INodeFile testRecoveryFile since blocks[255] is
> non-complete, where blocks=[blk_1073741825_1001, blk_1073741826_1002...
> {code}
> The fix is to count replicas on decommissioning nodes when completing last
> block in BlockManager.commitOrCompleteLastBlock, as we know that the
> DecommissionManager will not decommission a node that has UC blocks.
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]