[
https://issues.apache.org/jira/browse/HDFS-12642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16205432#comment-16205432
]
Yongjun Zhang edited comment on HDFS-12642 at 10/16/17 11:54 PM:
-----------------------------------------------------------------
Makes sense [~xiaochen].
I found the failed test was reported as HDFS-12279.
+1 on rev2.
was (Author: yzhangal):
Makes sense [~xiaochen].
I found the failed test was reported as HDFS-12279.
+2 on rev2.
> Log block and datanode details in BlockRecoveryWorker
> -----------------------------------------------------
>
> Key: HDFS-12642
> URL: https://issues.apache.org/jira/browse/HDFS-12642
> Project: Hadoop HDFS
> Issue Type: Improvement
> Components: datanode
> Reporter: Xiao Chen
> Assignee: Xiao Chen
> Fix For: 2.9.0, 2.8.3, 3.0.0
>
> Attachments: HDFS-12642.01.patch, HDFS-12642.02.patch
>
>
> In a recent investigation, we have seen a weird block recovery issue, which
> is difficult to reach to a conclusion because of insufficient logs.
> For the most critical part of the events, we see block recovery failed to
> {{commitBlockSynchronization]} on the NN, due to the block not closed. This
> leaves the file as open forever (for 1+ months).
> The reason the block was not closed on NN, was because it is configured with
> {{dfs.namenode.replication.min}} =2, and only 1 replica was with the latest
> genstamp.
> We were not able to tell why only 1 replica is on latest genstamp.
> From the primary node of the recovery (ps2204), {{initReplicaRecoveryImpl}}
> was called on each of the 7 DNs the block were ever placed. All DNs but
> ps2204 and ps3765 failed because of genstamp comparison - that's expected.
> ps2204 and ps3765 have gone past the comparison (since no exceptions from
> their logs), but {{updateReplicaUnderRecovery}} only appeared to be called on
> ps3765.
> This jira is to propose we log more details when {{BlockRecoveryWorker}} is
> about to call {{updateReplicaUnderRecovery}} on the DataNodes, so this could
> be figured out in the future.
> {noformat}
> $ grep "updateReplica:" ps2204.dn.log
> $ grep "updateReplica:" ps3765.dn.log
> hadoop-hdfs-datanode-ps3765.log.2:{"@timestamp":"2017-09-13T00:56:20.933Z","source_host":"ps3765.example.com","file":"FsDatasetImpl.java","method":"updateReplicaUnderRecovery","level":"INFO","line_number":"2512","thread_name":"IPC
> Server handler 6 on
> 50020","@version":1,"logger_name":"org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl","message":"updateReplica:
> BP-550436645-17.142.147.13-1438988035284:blk_2172795728_1106150312,
> recoveryId=1107074793, length=65024, replica=ReplicaUnderRecovery,
> blk_2172795728_1106150312, RUR
> $ grep "initReplicaRecovery:" ps2204.dn.log
> hadoop-hdfs-datanode-ps2204.log.1:{"@timestamp":"2017-09-13T00:56:20.691Z","source_host":"ps2204.example.com","file":"FsDatasetImpl.java","method":"initReplicaRecoveryImpl","level":"INFO","line_number":"2441","thread_name":"org.apache.hadoop.hdfs.server.datanode.BlockRecoveryWorker$1@5ae3cb26","@version":1,"logger_name":"org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl","message":"initReplicaRecovery:
> blk_2172795728_1106150312, recoveryId=1107074793,
> replica=ReplicaWaitingToBeRecovered, blk_2172795728_1106150312, RWR
> hadoop-hdfs-datanode-ps2204.log.1:{"@timestamp":"2017-09-13T00:56:20.691Z","source_host":"ps2204.example.com","file":"FsDatasetImpl.java","method":"initReplicaRecoveryImpl","level":"INFO","line_number":"2497","thread_name":"org.apache.hadoop.hdfs.server.datanode.BlockRecoveryWorker$1@5ae3cb26","@version":1,"logger_name":"org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl","message":"initReplicaRecovery:
> changing replica state for blk_2172795728_1106150312 from RWR to
> RUR","class":"org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl","mdc":{}}
> $ grep "initReplicaRecovery:" ps3765.dn.log
> hadoop-hdfs-datanode-ps3765.log.2:{"@timestamp":"2017-09-13T00:56:20.457Z","source_host":"ps3765.example.com","file":"FsDatasetImpl.java","method":"initReplicaRecoveryImpl","level":"INFO","line_number":"2441","thread_name":"IPC
> Server handler 5 on
> 50020","@version":1,"logger_name":"org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl","message":"initReplicaRecovery:
> blk_2172795728_1106150312, recoveryId=1107074793,
> replica=ReplicaBeingWritten, blk_2172795728_1106150312, RBW
> hadoop-hdfs-datanode-ps3765.log.2:{"@timestamp":"2017-09-13T00:56:20.457Z","source_host":"ps3765.example.com","file":"FsDatasetImpl.java","method":"initReplicaRecoveryImpl","level":"INFO","line_number":"2441","thread_name":"IPC
> Server handler 5 on
> 50020","@version":1,"logger_name":"org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl","message":"initReplicaRecovery:
> blk_2172795728_1106150312, recoveryId=1107074793,
> replica=ReplicaBeingWritten, blk_2172795728_1106150312, RBW
> hadoop-hdfs-datanode-ps3765.log.2:{"@timestamp":"2017-09-13T00:56:20.457Z","source_host":"ps3765.example.com","file":"FsDatasetImpl.java","method":"initReplicaRecoveryImpl","level":"INFO","line_number":"2497","thread_name":"IPC
> Server handler 5 on
> 50020","@version":1,"logger_name":"org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl","message":"initReplicaRecovery:
> changing replica state for blk_2172795728_1106150312 from RBW to
> RUR","class":"org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl","mdc":{}}
> {noformat}
> P.S. HDFS-11499 was once suspected, but non-conclusive since we don't have
> all NN logs to know about the decommission.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]