[ https://issues.apache.org/jira/browse/HDFS-12642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16204993#comment-16204993 ]
Xiao Chen commented on HDFS-12642: ---------------------------------- Thanks Yongjun for the review. Patch 2 addressed the comments. New example logs for convenience: {noformat} 2017-10-14 19:43:25,345 [org.apache.hadoop.hdfs.server.datanode.BlockRecoveryWorker$1@40bdcea2] INFO datanode.DataNode (BlockRecoveryWorker.java:logRecoverBlock(549)) - BlockRecoveryWorker: NameNode at localhost/127.0.0.1:52885 calls recoverBlock(BP-342842897-10.0.0.51-1508035399504:blk_1073741827_1003, targets=[DatanodeInfoWithStorage[127.0.0.1:52896,null,null], DatanodeInfoWithStorage[127.0.0.1:52901,null,null], DatanodeInfoWithStorage[127.0.0.1:52888,null,null]], newGenerationStamp=1004, newBlock=null, isStriped=false) 2017-10-14 19:43:25,368 [org.apache.hadoop.hdfs.server.datanode.BlockRecoveryWorker$1@40bdcea2] INFO datanode.DataNode (BlockRecoveryWorker.java:syncBlock(200)) - BlockRecoveryWorker: block=BP-342842897-10.0.0.51-1508035399504:blk_1073741827_1003 (length=952), isTruncateRecovery=false, syncList=[block:blk_1073741827_1003[numBytes=952,originalReplicaState=FINALIZED] node:DatanodeInfoWithStorage[127.0.0.1:52896,null,null], block:blk_1073741827_1003[numBytes=952,originalReplicaState=FINALIZED] node:DatanodeInfoWithStorage[127.0.0.1:52901,null,null], block:blk_1073741827_1003[numBytes=952,originalReplicaState=FINALIZED] node:DatanodeInfoWithStorage[127.0.0.1:52888,null,null]] 2017-10-14 19:43:25,369 [org.apache.hadoop.hdfs.server.datanode.BlockRecoveryWorker$1@40bdcea2] INFO datanode.DataNode (BlockRecoveryWorker.java:syncBlock(291)) - BlockRecoveryWorker: block=BP-342842897-10.0.0.51-1508035399504:blk_1073741827_1003 (length=952), bestState=FINALIZED, newBlock=BP-342842897-10.0.0.51-1508035399504:blk_1073741827_1004 (length=952), participatingList=[block:blk_1073741827_1003[numBytes=952,originalReplicaState=FINALIZED] node:DatanodeInfoWithStorage[127.0.0.1:52896,null,null], block:blk_1073741827_1003[numBytes=952,originalReplicaState=FINALIZED] node:DatanodeInfoWithStorage[127.0.0.1:52901,null,null], block:blk_1073741827_1003[numBytes=952,originalReplicaState=FINALIZED] node:DatanodeInfoWithStorage[127.0.0.1:52888,null,null]] {noformat} > Log block and datanode details in BlockRecoveryWorker > ----------------------------------------------------- > > Key: HDFS-12642 > URL: https://issues.apache.org/jira/browse/HDFS-12642 > Project: Hadoop HDFS > Issue Type: Improvement > Components: datanode > Reporter: Xiao Chen > Assignee: Xiao Chen > Attachments: HDFS-12642.01.patch > > > In a recent investigation, we have seen a weird block recovery issue, which > is difficult to reach to a conclusion because of insufficient logs. > For the most critical part of the events, we see block recovery failed to > {{commitBlockSynchronization]} on the NN, due to the block not closed. This > leaves the file as open forever (for 1+ months). > The reason the block was not closed on NN, was because it is configured with > {{dfs.namenode.replication.min}} =2, and only 1 replica was with the latest > genstamp. > We were not able to tell why only 1 replica is on latest genstamp. > From the primary node of the recovery (ps2204), {{initReplicaRecoveryImpl}} > was called on each of the 7 DNs the block were ever placed. All DNs but > ps2204 and ps3765 failed because of genstamp comparison - that's expected. > ps2204 and ps3765 have gone past the comparison (since no exceptions from > their logs), but {{updateReplicaUnderRecovery}} only appeared to be called on > ps3765. > This jira is to propose we log more details when {{BlockRecoveryWorker}} is > about to call {{updateReplicaUnderRecovery}} on the DataNodes, so this could > be figured out in the future. > {noformat} > $ grep "updateReplica:" ps2204.dn.log > $ grep "updateReplica:" ps3765.dn.log > hadoop-hdfs-datanode-ps3765.log.2:{"@timestamp":"2017-09-13T00:56:20.933Z","source_host":"ps3765.example.com","file":"FsDatasetImpl.java","method":"updateReplicaUnderRecovery","level":"INFO","line_number":"2512","thread_name":"IPC > Server handler 6 on > 50020","@version":1,"logger_name":"org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl","message":"updateReplica: > BP-550436645-17.142.147.13-1438988035284:blk_2172795728_1106150312, > recoveryId=1107074793, length=65024, replica=ReplicaUnderRecovery, > blk_2172795728_1106150312, RUR > $ grep "initReplicaRecovery:" ps2204.dn.log > hadoop-hdfs-datanode-ps2204.log.1:{"@timestamp":"2017-09-13T00:56:20.691Z","source_host":"ps2204.example.com","file":"FsDatasetImpl.java","method":"initReplicaRecoveryImpl","level":"INFO","line_number":"2441","thread_name":"org.apache.hadoop.hdfs.server.datanode.BlockRecoveryWorker$1@5ae3cb26","@version":1,"logger_name":"org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl","message":"initReplicaRecovery: > blk_2172795728_1106150312, recoveryId=1107074793, > replica=ReplicaWaitingToBeRecovered, blk_2172795728_1106150312, RWR > hadoop-hdfs-datanode-ps2204.log.1:{"@timestamp":"2017-09-13T00:56:20.691Z","source_host":"ps2204.example.com","file":"FsDatasetImpl.java","method":"initReplicaRecoveryImpl","level":"INFO","line_number":"2497","thread_name":"org.apache.hadoop.hdfs.server.datanode.BlockRecoveryWorker$1@5ae3cb26","@version":1,"logger_name":"org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl","message":"initReplicaRecovery: > changing replica state for blk_2172795728_1106150312 from RWR to > RUR","class":"org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl","mdc":{}} > $ grep "initReplicaRecovery:" ps3765.dn.log > hadoop-hdfs-datanode-ps3765.log.2:{"@timestamp":"2017-09-13T00:56:20.457Z","source_host":"ps3765.example.com","file":"FsDatasetImpl.java","method":"initReplicaRecoveryImpl","level":"INFO","line_number":"2441","thread_name":"IPC > Server handler 5 on > 50020","@version":1,"logger_name":"org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl","message":"initReplicaRecovery: > blk_2172795728_1106150312, recoveryId=1107074793, > replica=ReplicaBeingWritten, blk_2172795728_1106150312, RBW > hadoop-hdfs-datanode-ps3765.log.2:{"@timestamp":"2017-09-13T00:56:20.457Z","source_host":"ps3765.example.com","file":"FsDatasetImpl.java","method":"initReplicaRecoveryImpl","level":"INFO","line_number":"2441","thread_name":"IPC > Server handler 5 on > 50020","@version":1,"logger_name":"org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl","message":"initReplicaRecovery: > blk_2172795728_1106150312, recoveryId=1107074793, > replica=ReplicaBeingWritten, blk_2172795728_1106150312, RBW > hadoop-hdfs-datanode-ps3765.log.2:{"@timestamp":"2017-09-13T00:56:20.457Z","source_host":"ps3765.example.com","file":"FsDatasetImpl.java","method":"initReplicaRecoveryImpl","level":"INFO","line_number":"2497","thread_name":"IPC > Server handler 5 on > 50020","@version":1,"logger_name":"org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl","message":"initReplicaRecovery: > changing replica state for blk_2172795728_1106150312 from RBW to > RUR","class":"org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl","mdc":{}} > {noformat} > P.S. HDFS-11499 was once suspected, but non-conclusive since we don't have > all NN logs to know about the decommission. -- This message was sent by Atlassian JIRA (v6.4.14#64029) --------------------------------------------------------------------- To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org