[ 
https://issues.apache.org/jira/browse/HBASE-8389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13642669#comment-13642669
 ] 

Nicolas Liochon commented on HBASE-8389:
----------------------------------------

Varun, I +1 Stack: the timeout setting you mentionned are quite impressive!
Thanks a lot for all this work.

Here is my understanding, please correct me where I'm wrong.

In don't think that single / multiple block is an issue, even if it's better to 
have single block (increased parallelism).

HBase has a dataloss risk: we need to wait for the end of recoverFileLease 
before reading.
 => Either by polling the NN and calling recoverFileLease multiple times
 => Either calling isFileClosed (HDFS-4525) (and polling as well) where it's 
available.

I'm not sure that we can poll every second recoverFileLease. When I try I have 
the same logs as Eric: "java.io.IOException: The recovery id 2494 does not 
match current recovery id 2495 for block", and the state of the namenode seems 
strange. 

In critical scenarios, the recoverFileLease won't happen at all. The 
probability is greatly decreased by HDFS-4721, but it's not zero.

In critical scenarios, the recoverFileLease will start, but will be stuck in 
bad datanodes. The probability is greatly decreased by HDFS-4721 and HDFS-4754, 
but it's not zero. Here, we need to limit the number of retry in HDFS to one, 
whatever the global setting, to be on the safe side (no hdfs jira for this).

I see a possible common implementation (trunk / hbase 0.94)
 - if HDFS-4754, calls markAsStale to be sure this datanode won't be used.
 - call recoverFileLease a first time
 - if HDFS-4525 is available, call isFileClosed every second to detect that the 
recovery is done
 - every 60s, call again recoverFileLease (either isFileClosed is missing, 
either we went into one of the bad scenario above). 

This would mean: no dataloss and a MTTR of:
 - less than a minute if we have stale mode + HDFS-4721 + HDFS-4754 + HDFS-4525 
+ no retry in HDFS recoverLease or Varun's settings.
 - around 12 minutes if we have none of the above. But that's what we have 
already without the stale mode imho.
 - in the middle if we have a subset of the above patches and config.

As HDFS-4721 seems validated by the HDFS dev team, I think that my only 
question is: can we poll very frequently recoverFileLease if we don't have 
isFileClosed?

As a side node, tests more or less similar to yours with HBase trunk and HDFS 
branch-2 trunk (without your settings but with a hack to skip the deadnodes) 
brings similar results.

                
> HBASE-8354 forces Namenode into loop with lease recovery requests
> -----------------------------------------------------------------
>
>                 Key: HBASE-8389
>                 URL: https://issues.apache.org/jira/browse/HBASE-8389
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Varun Sharma
>            Assignee: Varun Sharma
>            Priority: Critical
>             Fix For: 0.94.8
>
>         Attachments: 8389-0.94.txt, 8389-0.94-v2.txt, 8389-0.94-v3.txt, 
> 8389-0.94-v4.txt, 8389-0.94-v5.txt, 8389-0.94-v6.txt, 8389-trunk-v1.txt, 
> 8389-trunk-v2.patch, 8389-trunk-v2.txt, 8389-trunk-v3.txt, nn1.log, nn.log, 
> sample.patch
>
>
> We ran hbase 0.94.3 patched with 8354 and observed too many outstanding lease 
> recoveries because of the short retry interval of 1 second between lease 
> recoveries.
> The namenode gets into the following loop:
> 1) Receives lease recovery request and initiates recovery choosing a primary 
> datanode every second
> 2) A lease recovery is successful and the namenode tries to commit the block 
> under recovery as finalized - this takes < 10 seconds in our environment 
> since we run with tight HDFS socket timeouts.
> 3) At step 2), there is a more recent recovery enqueued because of the 
> aggressive retries. This causes the committed block to get preempted and we 
> enter a vicious cycle
> So we do,  <initiate_recovery> --> <commit_block> --> 
> <commit_preempted_by_another_recovery>
> This loop is paused after 300 seconds which is the 
> "hbase.lease.recovery.timeout". Hence the MTTR we are observing is 5 minutes 
> which is terrible. Our ZK session timeout is 30 seconds and HDFS stale node 
> detection timeout is 20 seconds.
> Note that before the patch, we do not call recoverLease so aggressively - 
> also it seems that the HDFS namenode is pretty dumb in that it keeps 
> initiating new recoveries for every call. Before the patch, we call 
> recoverLease, assume that the block was recovered, try to get the file, it 
> has zero length since its under recovery, we fail the task and retry until we 
> get a non zero length. So things just work.
> Fixes:
> 1) Expecting recovery to occur within 1 second is too aggressive. We need to 
> have a more generous timeout. The timeout needs to be configurable since 
> typically, the recovery takes as much time as the DFS timeouts. The primary 
> datanode doing the recovery tries to reconcile the blocks and hits the 
> timeouts when it tries to contact the dead node. So the recovery is as fast 
> as the HDFS timeouts.
> 2) We have another issue I report in HDFS 4721. The Namenode chooses the 
> stale datanode to perform the recovery (since its still alive). Hence the 
> first recovery request is bound to fail. So if we want a tight MTTR, we 
> either need something like HDFS 4721 or we need something like this
>   recoverLease(...)
>   sleep(1000)
>   recoverLease(...)
>   sleep(configuredTimeout)
>   recoverLease(...)
>   sleep(configuredTimeout)
> Where configuredTimeout should be large enough to let the recovery happen but 
> the first timeout is short so that we get past the moot recovery in step #1.
>  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to