[jira] [Commented] (HDFS-4721) Speed up lease/block recovery when DN fails and a block goes into recovery

Varun Sharma (JIRA) Wed, 24 Apr 2013 17:23:15 -0700

    [ 
https://issues.apache.org/jira/browse/HDFS-4721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13641221#comment-13641221
 ]


Varun Sharma commented on HDFS-4721:
------------------------------------

Here are the remainder messages - looking at it there are messages 40 minutes 
later when I bring back the dead datanode. I think it reports the block and a 
recovery is performed, then, since its still in the recovery queue.

2013-04-24 06:57:14,373 INFO BlockStateChange: BLOCK* processReport: 
blk_-2482251885029951704_11942 on 10.168.12.138:50010 size 7039284 does not 
belong to any file
2013-04-24 06:57:14,373 INFO BlockStateChange: BLOCK* InvalidateBlocks: add 
blk_-2482251885029951704_11942 to 10.168.12.138:50010
2013-04-24 06:57:17,240 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* 
InvalidateBlocks: ask 10.168.12.138:50010 to delete 
[blk_-121400693146753449_11986, blk_7815495529310756756_10715, 
blk_4125941153395778345_10713, blk_7979989947202390292_11938, 
blk_-2482251885029951704_11942, blk_-2834772731171489244_10711]
2013-04-24 09:14:25,284 INFO BlockStateChange: BLOCK* processReport: 
blk_-2482251885029951704_11942 on 10.170.6.131:50010 size 7039284 does not 
belong to any file
2013-04-24 09:14:25,284 INFO BlockStateChange: BLOCK* InvalidateBlocks: add 
blk_-2482251885029951704_11942 to 10.170.6.131:50010
2013-04-24 09:14:26,916 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* 
InvalidateBlocks: ask 10.170.6.131:50010 to delete 
[blk_-6242914570577158362_12305, blk_7396709163981662539_11419, 
blk_-121400693146753449_11986, blk_7815495529310756756_10716, 
blk_8175754220082115190_12303, blk_1204694577977643985_12307, 
blk_4125941153395778345_10718, blk_7979989947202390292_11938, 
blk_-2482251885029951704_11942, blk_-3317357101836432862_12390, 
blk_-5206526708499881023_11940, blk_-2834772731171489244_10717]
2013-04-24 16:38:26,254 INFO 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem: 
commitBlockSynchronization(lastblock=BP-889095791-10.171.1.40-1366491606582:blk_-2482251885029951704_11942,
 newgenerationstamp=12012, newlength=7044280, newtargets=[10.170.15.97:50010], 
closeFile=true, deleteBlock=false)
2013-04-24 16:38:26,255 ERROR org.apache.hadoop.security.UserGroupInformation: 
PriviledgedActionException as:hdfs (auth:SIMPLE) cause:java.io.IOException: 
Block (=BP-889095791-10.171.1.40-1366491606582:blk_-2482251885029951704_11942) 
not found
2013-04-24 16:38:26,255 INFO org.apache.hadoop.ipc.Server: IPC Server handler 
55 on 8020, call 
org.apache.hadoop.hdfs.server.protocol.DatanodeProtocol.commitBlockSynchronization
 from 10.170.15.97:44875: error: java.io.IOException: Block 
(=BP-889095791-10.171.1.40-1366491606582:blk_-2482251885029951704_11942) not 
found
java.io.IOException: Block 
(=BP-889095791-10.171.1.40-1366491606582:blk_-2482251885029951704_11942) not 
found
2013-04-24 16:38:26,255 INFO BlockStateChange: BLOCK* addBlock: block 
blk_-2482251885029951704_12012 on 10.170.15.97:50010 size 7044280 does not 
belong to any file
2013-04-24 16:38:26,255 INFO BlockStateChange: BLOCK* InvalidateBlocks: add 
blk_-2482251885029951704_12012 to 10.170.15.97:50010
2013-04-24 16:38:28,766 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* 
InvalidateBlocks: ask 10.170.15.97:50010 to delete 
[blk_-121400693146753449_12233, blk_-2482251885029951704_12012, 
blk_7979989947202390292_11989]
                
> Speed up lease/block recovery when DN fails and a block goes into recovery
> --------------------------------------------------------------------------
>
>                 Key: HDFS-4721
>                 URL: https://issues.apache.org/jira/browse/HDFS-4721
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>          Components: namenode
>    Affects Versions: 2.0.3-alpha
>            Reporter: Varun Sharma
>             Fix For: 2.0.4-alpha
>
>         Attachments: 4721-hadoop2.patch, 4721-trunk.patch, 
> 4721-trunk-v2.patch, 4721-v2.patch, 4721-v3.patch, 4721-v4.patch, 
> 4721-v5.patch, 4721-v6.patch, 4721-v7.patch, 4721-v8.patch
>
>
> This was observed while doing HBase WAL recovery. HBase uses append to write 
> to its write ahead log. So initially the pipeline is setup as
> DN1 --> DN2 --> DN3
> This WAL needs to be read when DN1 fails since it houses the HBase 
> regionserver for the WAL.
> HBase first recovers the lease on the WAL file. During recovery, we choose 
> DN1 as the primary DN to do the recovery even though DN1 has failed and is 
> not heartbeating any more.
> Avoiding the stale DN1 would speed up recovery and reduce hbase MTTR. There 
> are two options.
> a) Ride on HDFS 3703 and if stale node detection is turned on, we do not 
> choose stale datanodes (typically not heart beated for 20-30 seconds) as 
> primary DN(s)
> b) We sort the replicas in order of last heart beat and always pick the ones 
> which gave the most recent heart beat
> Going to the dead datanode increases lease + block recovery since the block 
> goes into UNDER_RECOVERY state even though no one is recovering it actively. 
> Please let me know if this makes sense. If yes, whether we should move 
> forward with a) or b).
> Thanks

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HDFS-4721) Speed up lease/block recovery when DN fails and a block goes into recovery

Reply via email to