[ 
https://issues.apache.org/jira/browse/HDFS-7342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14223105#comment-14223105
 ] 

Kihwal Lee commented on HDFS-7342:
----------------------------------

bq. How about scheduling replication during the lease recovery for such 
penultimate blocks with atleast one replica available to satisfy 
min-replication, then go ahead for lease recovery. Till now this situation 
might not have experienced as minReplication itself by default was 1. 

Let's first think about the meaning of min-replication. It is the level of 
degradation that is allowed before being considered critical in terms of data 
durability. Falling below this level does not necessarily mean a failure (i.e. 
data not available) unless min-replica is 1. For synchronous or 
semi-synchronous operations such as {{addBlcok()}} and {{complete()}}, this is 
*enforced* to maintain the healthy steady state. Clients also do their best to 
meet this, but any failures on datanodes between finalizing a block and sending 
the IBR are beyond their control.  For asynchronous recovery activities such as 
lease recovery and replication, min-replica should be advisory.  Since 
replication is already doing the right thing, let's focus on lease recovery.

Dealing with COMMITTED blocks is simpler. Being committed means the client 
thought enough number of replicas were finalized. If a lease is expired, the 
block can simply turn in to COMPLETE. If it has at least one live replica, it 
will be replicated soon after closing the file. If it doesn't, the block will 
be considered missing.  I think it is better to report the committed but 
missing data early rather than hiding it in the infinite lease recovery cycle.  
Also, recovery will be faster this way.  If all blocks in a file are in either 
complete or committed state, lease recovery may force complete all committed 
blocks and close the file. The rest will be up to the replication monitor.

{{recoverLeaseInternal()}} and {{internalReleaseLease()}} will need to be made 
to distinguish the on-demand recovery from normal lease expiration.  For 
on-demand recovery, we might want it to fail if there is no live replicas, as a 
file lease is normally recovered for subsequent append or copy(read). If there 
is no data, they will fail.

For recovering blocks in the UNDER_CONSTRUCTION state, we can make 
{{commitBlockSynchronization()}} to force commit when there is at least one 
replica, ignoring min-replication. It will allow the recovery to make progress 
and eventually the file to be closed if there is at least one replica per 
block. Then the blocks can be replicated.  This is far better than getting 
stuck in recovery.

> Lease Recovery doesn't happen some times
> ----------------------------------------
>
>                 Key: HDFS-7342
>                 URL: https://issues.apache.org/jira/browse/HDFS-7342
>             Project: Hadoop HDFS
>          Issue Type: Bug
>    Affects Versions: 2.0.0-alpha
>            Reporter: Ravi Prakash
>            Assignee: Ravi Prakash
>         Attachments: HDFS-7342.1.patch, HDFS-7342.2.patch, HDFS-7342.3.patch
>
>
> In some cases, LeaseManager tries to recover a lease, but is not able to. 
> HDFS-4882 describes a possibility of that. We should fix this



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to