[
https://issues.apache.org/jira/browse/HDFS-4721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13637867#comment-13637867
]
Ted Yu commented on HDFS-4721:
------------------------------
In FSNamesystem#internalReleaseLease():
{code}
case UNDER_CONSTRUCTION:
case UNDER_RECOVERY:
final BlockInfoUnderConstruction uc =
(BlockInfoUnderConstruction)lastBlock;
// setup the last block locations from the blockManager if not known
if (uc.getNumExpectedLocations() == 0) {
uc.setExpectedLocations(blockManager.getNodes(lastBlock));
}
// start recovery of the last block for this file
long blockRecoveryId = nextGenerationStamp();
lease = reassignLease(lease, src, recoveryLeaseHolder, pendingFile);
{code}
Can we distinguish UNDER_RECOVERY from UNDER_CONSTRUCTION so that the problem
described by HBASE-8389 can be avoided ?
> Speed up lease/block recovery when DN fails and a block goes into recovery
> --------------------------------------------------------------------------
>
> Key: HDFS-4721
> URL: https://issues.apache.org/jira/browse/HDFS-4721
> Project: Hadoop HDFS
> Issue Type: Improvement
> Components: namenode
> Affects Versions: 2.0.3-alpha
> Reporter: Varun Sharma
> Fix For: 2.0.4-alpha
>
> Attachments: 4721-hadoop2.patch
>
>
> This was observed while doing HBase WAL recovery. HBase uses append to write
> to its write ahead log. So initially the pipeline is setup as
> DN1 --> DN2 --> DN3
> This WAL needs to be read when DN1 fails since it houses the HBase
> regionserver for the WAL.
> HBase first recovers the lease on the WAL file. During recovery, we choose
> DN1 as the primary DN to do the recovery even though DN1 has failed and is
> not heartbeating any more.
> Avoiding the stale DN1 would speed up recovery and reduce hbase MTTR. There
> are two options.
> a) Ride on HDFS 3703 and if stale node detection is turned on, we do not
> choose stale datanodes (typically not heart beated for 20-30 seconds) as
> primary DN(s)
> b) We sort the replicas in order of last heart beat and always pick the ones
> which gave the most recent heart beat
> Going to the dead datanode increases lease + block recovery since the block
> goes into UNDER_RECOVERY state even though no one is recovering it actively.
> Please let me know if this makes sense. If yes, whether we should move
> forward with a) or b).
> Thanks
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira