[ 
https://issues.apache.org/jira/browse/HDFS-4721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13637698#comment-13637698
 ] 

Varun Sharma commented on HDFS-4721:
------------------------------------

I attached a rough patch which
a) Avoids a stale node from being chosen as the primary datanode to do the 
recovery
b) Skips over the stale nodes as the recovery locations when passing them to 
the primary datanode

Earlier - recovery takes as long as dfs.socket.timeout but now it takes roughly 
1-2 seconds (which is basically the heartbeat interval). Here are the NN logs 
on a test where we suspend a HBase region server and the HDFS datanode. Block 
is finalized within 1 second. The patch is rough and I am looking for comments.

2013-04-21 23:31:40,036 INFO BlockStateChange: BLOCK* 
blk_1083189771170117282_5999{blockUCState=UNDER_RECOVERY, primaryNodeIndex=-1, 
replicas=[ReplicaUnderConstruction[10.170.15.97:50010|RBW], 
ReplicaUnderConstruction[10.170.6.131:50010|RBW], 
ReplicaUnderConstruction[10.157.42.32:50010|RBW]]} skipping stale node for 
primary, node=10.170.15.97:50010
2013-04-21 23:31:40,036 INFO BlockStateChange: BLOCK* 
blk_1083189771170117282_5999{blockUCState=UNDER_RECOVERY, primaryNodeIndex=1, 
replicas=[ReplicaUnderConstruction[10.170.15.97:50010|RBW], 
ReplicaUnderConstruction[10.170.6.131:50010|RBW], 
ReplicaUnderConstruction[10.157.42.32:50010|RBW]]} recovery started, 
primary=10.170.6.131:50010
2013-04-21 23:31:40,036 WARN org.apache.hadoop.hdfs.StateChange: DIR* 
NameSystem.internalReleaseLease: File 
/hbase/.logs/ip-10-170-15-97.ec2.internal,60020,1366586774505-splitting/ip-10-170-15-97.ec2.internal%2C60020%2C1366586774505.1366586775415
 has not been closed. Lease recovery is in progress. RecoveryId = 6148 for 
block blk_1083189771170117282_5999{blockUCState=UNDER_RECOVERY, 
primaryNodeIndex=1, replicas=[ReplicaUnderConstruction[10.170.15.97:50010|RBW], 
ReplicaUnderConstruction[10.170.6.131:50010|RBW], 
ReplicaUnderConstruction[10.157.42.32:50010|RBW]]}
2013-04-21 23:31:41,280 INFO BlockStateChange: BLOCK* addStoredBlock: blockMap 
updated: 10.170.6.131:50010 is added to 
blk_1083189771170117282_5999{blockUCState=UNDER_RECOVERY, primaryNodeIndex=1, 
replicas=[ReplicaUnderConstruction[10.170.15.97:50010|RBW], 
ReplicaUnderConstruction[10.170.6.131:50010|RBW], 
ReplicaUnderConstruction[10.157.42.32:50010|RBW]]} size 0
2013-04-21 23:31:41,282 INFO BlockStateChange: BLOCK* addStoredBlock: blockMap 
updated: 10.157.42.32:50010 is added to 
blk_1083189771170117282_5999{blockUCState=UNDER_RECOVERY, primaryNodeIndex=1, 
replicas=[ReplicaUnderConstruction[10.170.15.97:50010|RBW], 
ReplicaUnderConstruction[10.170.6.131:50010|RBW], 
ReplicaUnderConstruction[10.157.42.32:50010|RBW]]} size 0
2013-04-21 23:31:41,282 INFO 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem: 
commitBlockSynchronization(lastblock=BP-889095791-10.171.1.40-1366491606582:blk_1083189771170117282_5999,
 newgenerationstamp=6148, newlength=51174873, newtargets=[10.170.6.131:50010, 
10.157.42.32:50010], closeFile=true, deleteBlock=false)
2013-04-21 23:31:41,290 INFO 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem: 
commitBlockSynchronization(newblock=BP-889095791-10.171.1.40-1366491606582:blk_1083189771170117282_5999,
 
file=/hbase/.logs/ip-10-170-15-97.ec2.internal,60020,1366586774505-splitting/ip-10-170-15-97.ec2.internal%2C60020%2C1366586774505.1366586775415,
 newgenerationstamp=6148, newlength=51174873, newtargets=[10.170.6.131:50010, 
10.157.42.32:50010]) successful



                
> Speed up lease/block recovery when DN fails and a block goes into recovery
> --------------------------------------------------------------------------
>
>                 Key: HDFS-4721
>                 URL: https://issues.apache.org/jira/browse/HDFS-4721
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>          Components: namenode
>    Affects Versions: 2.0.3-alpha
>            Reporter: Varun Sharma
>         Attachments: 4721-hadoop2.patch
>
>
> This was observed while doing HBase WAL recovery. HBase uses append to write 
> to its write ahead log. So initially the pipeline is setup as
> DN1 --> DN2 --> DN3
> This WAL needs to be read when DN1 fails since it houses the HBase 
> regionserver for the WAL.
> HBase first recovers the lease on the WAL file. During recovery, we choose 
> DN1 as the primary DN to do the recovery even though DN1 has failed and is 
> not heartbeating any more.
> Avoiding the stale DN1 would speed up recovery and reduce hbase MTTR. There 
> are two options.
> a) Ride on HDFS 3703 and if stale node detection is turned on, we do not 
> choose stale datanodes (typically not heart beated for 20-30 seconds) as 
> primary DN(s)
> b) We sort the replicas in order of last heart beat and always pick the ones 
> which gave the most recent heart beat
> Going to the dead datanode increases lease + block recovery since the block 
> goes into UNDER_RECOVERY state even though no one is recovering it actively. 
> Please let me know if this makes sense. If yes, whether we should move 
> forward with a) or b).
> Thanks

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to