[ https://issues.apache.org/jira/browse/HDFS-4721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13641221#comment-13641221 ]
Varun Sharma commented on HDFS-4721: ------------------------------------ Here are the remainder messages - looking at it there are messages 40 minutes later when I bring back the dead datanode. I think it reports the block and a recovery is performed, then, since its still in the recovery queue. 2013-04-24 06:57:14,373 INFO BlockStateChange: BLOCK* processReport: blk_-2482251885029951704_11942 on 10.168.12.138:50010 size 7039284 does not belong to any file 2013-04-24 06:57:14,373 INFO BlockStateChange: BLOCK* InvalidateBlocks: add blk_-2482251885029951704_11942 to 10.168.12.138:50010 2013-04-24 06:57:17,240 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* InvalidateBlocks: ask 10.168.12.138:50010 to delete [blk_-121400693146753449_11986, blk_7815495529310756756_10715, blk_4125941153395778345_10713, blk_7979989947202390292_11938, blk_-2482251885029951704_11942, blk_-2834772731171489244_10711] 2013-04-24 09:14:25,284 INFO BlockStateChange: BLOCK* processReport: blk_-2482251885029951704_11942 on 10.170.6.131:50010 size 7039284 does not belong to any file 2013-04-24 09:14:25,284 INFO BlockStateChange: BLOCK* InvalidateBlocks: add blk_-2482251885029951704_11942 to 10.170.6.131:50010 2013-04-24 09:14:26,916 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* InvalidateBlocks: ask 10.170.6.131:50010 to delete [blk_-6242914570577158362_12305, blk_7396709163981662539_11419, blk_-121400693146753449_11986, blk_7815495529310756756_10716, blk_8175754220082115190_12303, blk_1204694577977643985_12307, blk_4125941153395778345_10718, blk_7979989947202390292_11938, blk_-2482251885029951704_11942, blk_-3317357101836432862_12390, blk_-5206526708499881023_11940, blk_-2834772731171489244_10717] 2013-04-24 16:38:26,254 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: commitBlockSynchronization(lastblock=BP-889095791-10.171.1.40-1366491606582:blk_-2482251885029951704_11942, newgenerationstamp=12012, newlength=7044280, newtargets=[10.170.15.97:50010], closeFile=true, deleteBlock=false) 2013-04-24 16:38:26,255 ERROR org.apache.hadoop.security.UserGroupInformation: PriviledgedActionException as:hdfs (auth:SIMPLE) cause:java.io.IOException: Block (=BP-889095791-10.171.1.40-1366491606582:blk_-2482251885029951704_11942) not found 2013-04-24 16:38:26,255 INFO org.apache.hadoop.ipc.Server: IPC Server handler 55 on 8020, call org.apache.hadoop.hdfs.server.protocol.DatanodeProtocol.commitBlockSynchronization from 10.170.15.97:44875: error: java.io.IOException: Block (=BP-889095791-10.171.1.40-1366491606582:blk_-2482251885029951704_11942) not found java.io.IOException: Block (=BP-889095791-10.171.1.40-1366491606582:blk_-2482251885029951704_11942) not found 2013-04-24 16:38:26,255 INFO BlockStateChange: BLOCK* addBlock: block blk_-2482251885029951704_12012 on 10.170.15.97:50010 size 7044280 does not belong to any file 2013-04-24 16:38:26,255 INFO BlockStateChange: BLOCK* InvalidateBlocks: add blk_-2482251885029951704_12012 to 10.170.15.97:50010 2013-04-24 16:38:28,766 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* InvalidateBlocks: ask 10.170.15.97:50010 to delete [blk_-121400693146753449_12233, blk_-2482251885029951704_12012, blk_7979989947202390292_11989] > Speed up lease/block recovery when DN fails and a block goes into recovery > -------------------------------------------------------------------------- > > Key: HDFS-4721 > URL: https://issues.apache.org/jira/browse/HDFS-4721 > Project: Hadoop HDFS > Issue Type: Improvement > Components: namenode > Affects Versions: 2.0.3-alpha > Reporter: Varun Sharma > Fix For: 2.0.4-alpha > > Attachments: 4721-hadoop2.patch, 4721-trunk.patch, > 4721-trunk-v2.patch, 4721-v2.patch, 4721-v3.patch, 4721-v4.patch, > 4721-v5.patch, 4721-v6.patch, 4721-v7.patch, 4721-v8.patch > > > This was observed while doing HBase WAL recovery. HBase uses append to write > to its write ahead log. So initially the pipeline is setup as > DN1 --> DN2 --> DN3 > This WAL needs to be read when DN1 fails since it houses the HBase > regionserver for the WAL. > HBase first recovers the lease on the WAL file. During recovery, we choose > DN1 as the primary DN to do the recovery even though DN1 has failed and is > not heartbeating any more. > Avoiding the stale DN1 would speed up recovery and reduce hbase MTTR. There > are two options. > a) Ride on HDFS 3703 and if stale node detection is turned on, we do not > choose stale datanodes (typically not heart beated for 20-30 seconds) as > primary DN(s) > b) We sort the replicas in order of last heart beat and always pick the ones > which gave the most recent heart beat > Going to the dead datanode increases lease + block recovery since the block > goes into UNDER_RECOVERY state even though no one is recovering it actively. > Please let me know if this makes sense. If yes, whether we should move > forward with a) or b). > Thanks -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira