[ 
https://issues.apache.org/jira/browse/HBASE-8389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13641467#comment-13641467
 ] 

Varun Sharma commented on HBASE-8389:
-------------------------------------

Alright, so it seems I have been stupid in running the recent tests. The lease 
recovery is correct in hadoop. I forgot what v5 patch exactly does, it reverts 
to old behaviour - I kept searching the namenode logs for multiple lease 
recoveries :)

HDFS timeouts (for region server and HDFS) - socket timeout = 3 seconds, socket 
write timeout = 5 seconds and ipc connect retries = 0 (timeout is hardcoded at 
20 seconds which is way too high)

I am summarizing each case:
1) After this patch,
When we split a log, we will do the following:
  a) Call recoverLease, which will enqueue a block recovery to the dead 
datanode, so a noop
  b) sleep 4 seconds
  c) Break the loop and access the file irrespective of whether recovery 
happened
  d) Sometimes fail but eventually get through

Note that lease recovery has not happened. If hbase finds a zero size hlog at 
any of the datanodes (the size is typically zero at the namenode since the file 
is not closed yet), it will error out and unassign the task, some other region 
server will pick up the split task. From the hbase console, I am always seeing 
non zero edits being split - so we are reading data. I am not sure if accumulo 
does similar checks for zero sized WALs, but [~ecn] will know better.

Since lease recovery has not happened, we risk data loss but it again depends 
on what kind of data loss accumulo sees, whether entire WAL(s) are lost or 
portions of WAL(s). If its entire WAL(s), maybe the zero sized check in HBase 
saves it from data loss. But if portions of WAL are being lost in accumulo when 
recoverLease return value is not checked, then we can have data loss after v5 
patch. Again I will let [~ecn] speak on that.

The good news though is that I am seeing pretty good MTTR in this case. Its 
typically 2-3 minutes and WAL splitting accounts for maybe 30-40 seconds. But 
note that I am running with HDFS 3912, 3703 and that my HDFS timeouts are 
configured to fail fast.

2) Before this patch but after 8354

We have the issue where lease recoveries pile up on the namenode faster than 
they can be served (every second), the side effect is that each latter recovery 
preempts the earlier one. Basically with HDFS it is simply not possible to get 
lease recovery within 4 seconds unless we use some of the stale node patches 
and really tighten all the HDFS timeouts and retries. So recoveries never 
finish in one second and they keep piling up and preempting earlier recoveries. 
Eventually we wait for 300 seconds, hbase.lease.recovery.timeout, after which 
we just open the file and mostly the last recovery has succeeded by then.

MTTR is not good in this case - at least 6 minutes for log splitting. On 
possibility could have been to reduce the number 300  seconds to maybe 20 
seconds. 

3) One can have the best of both worlds - a good MTTR and no/little data loss 
by opening files after real lease recovery has happened to avoid data 
corruption. For that, one would need to tune their HDFS timeouts to be low, the 
connect + socket timeouts, so that lease recoveries can happen within 5-10 
seconds. I think that, for such cases we should have a parameter, saying 
whether we want to force lease recovery before - I am going to raise a JIRA to 
discuss that configuration. Overall, if we had an isClosed() API life would be 
so much easier but a large number of hadoop releases do not have it, yet. I 
think this is more of a power user configuration but it probably makes sense to 
have one.

Thanks !
                
> HBASE-8354 forces Namenode into loop with lease recovery requests
> -----------------------------------------------------------------
>
>                 Key: HBASE-8389
>                 URL: https://issues.apache.org/jira/browse/HBASE-8389
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Varun Sharma
>            Assignee: Varun Sharma
>            Priority: Critical
>             Fix For: 0.94.8
>
>         Attachments: 8389-0.94.txt, 8389-0.94-v2.txt, 8389-0.94-v3.txt, 
> 8389-0.94-v4.txt, 8389-0.94-v5.txt, 8389-0.94-v6.txt, 8389-trunk-v1.txt, 
> 8389-trunk-v2.patch, 8389-trunk-v2.txt, 8389-trunk-v3.txt, nn1.log, nn.log, 
> sample.patch
>
>
> We ran hbase 0.94.3 patched with 8354 and observed too many outstanding lease 
> recoveries because of the short retry interval of 1 second between lease 
> recoveries.
> The namenode gets into the following loop:
> 1) Receives lease recovery request and initiates recovery choosing a primary 
> datanode every second
> 2) A lease recovery is successful and the namenode tries to commit the block 
> under recovery as finalized - this takes < 10 seconds in our environment 
> since we run with tight HDFS socket timeouts.
> 3) At step 2), there is a more recent recovery enqueued because of the 
> aggressive retries. This causes the committed block to get preempted and we 
> enter a vicious cycle
> So we do,  <initiate_recovery> --> <commit_block> --> 
> <commit_preempted_by_another_recovery>
> This loop is paused after 300 seconds which is the 
> "hbase.lease.recovery.timeout". Hence the MTTR we are observing is 5 minutes 
> which is terrible. Our ZK session timeout is 30 seconds and HDFS stale node 
> detection timeout is 20 seconds.
> Note that before the patch, we do not call recoverLease so aggressively - 
> also it seems that the HDFS namenode is pretty dumb in that it keeps 
> initiating new recoveries for every call. Before the patch, we call 
> recoverLease, assume that the block was recovered, try to get the file, it 
> has zero length since its under recovery, we fail the task and retry until we 
> get a non zero length. So things just work.
> Fixes:
> 1) Expecting recovery to occur within 1 second is too aggressive. We need to 
> have a more generous timeout. The timeout needs to be configurable since 
> typically, the recovery takes as much time as the DFS timeouts. The primary 
> datanode doing the recovery tries to reconcile the blocks and hits the 
> timeouts when it tries to contact the dead node. So the recovery is as fast 
> as the HDFS timeouts.
> 2) We have another issue I report in HDFS 4721. The Namenode chooses the 
> stale datanode to perform the recovery (since its still alive). Hence the 
> first recovery request is bound to fail. So if we want a tight MTTR, we 
> either need something like HDFS 4721 or we need something like this
>   recoverLease(...)
>   sleep(1000)
>   recoverLease(...)
>   sleep(configuredTimeout)
>   recoverLease(...)
>   sleep(configuredTimeout)
> Where configuredTimeout should be large enough to let the recovery happen but 
> the first timeout is short so that we get past the moot recovery in step #1.
>  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to