[ 
https://issues.apache.org/jira/browse/HBASE-8389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13642629#comment-13642629
 ] 

stack commented on HBASE-8389:
------------------------------

Reading over this nice, fat, info-dense issue, I am trying to figure what we 
need to add to trunk right now.

Sounds like checking the recoverFileLease return checking gained us little in 
the end (though Varun you think we want to keep going till its true though v5 
here skips out on it).  The valuable finding hereabouts is the need for a pause 
before going ahead with file open it seems.  Trunk does not have this pause.  I 
need to add a version of v5 to trunk?  (Holding our breath until an api not yet 
generally available, isFileClosed hbase-8394, shows up is not an option for 
now; nor is an expectation that all will just upgrade to an hdfs that has this 
api on either.)

hbase-7878 backport is now elided since we have added back the old behavior w/ 
patch applied here excepting the pause of an arbitrary enough 4seconds

The applied patch here does not loop on recoverLease after the 4seconds expire. 
 It breaks. In trunk we loop.  We should break too (...and let it fail if 0 
length and then let the next split task do a new recoverLease call?)

On the 4seconds, it seems that it rather should be the dfs timeout 
dfs.socket.timeout that hdfs is using -- plus a second or so -- rather than 
"4seconds" if I follow Varuns' reasoning above properly and just remove the new 
config 'hbase.lease.recovery.retry.interval' (We have enough configs already)?

Sounds like we are depending on WAL sizes being < HDFS block sizes.  This will 
not always be the case; we could go into a second block easily if a big edit 
comes in on the tail of the first block; and then there may be dataloss (TBD) 
because we have a file size (so we think the file recovered?)

Sounds also like we are relying file size being zero as a marker that file is 
not yet closed (I suppose that is ok because an empty WAL will be > 0 length 
IIRC.  We should doc. our dependency though)

Varun, i like your low timeouts.  Would you suggest we adjust hbase default 
timeouts down and recommend folks change their hdfs defaults if they want 
better MTTR?  If you had a blog post on your nice work done in here, I could at 
least point the refguide at it for those interested in improved MTTR (smile).
                
> HBASE-8354 forces Namenode into loop with lease recovery requests
> -----------------------------------------------------------------
>
>                 Key: HBASE-8389
>                 URL: https://issues.apache.org/jira/browse/HBASE-8389
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Varun Sharma
>            Assignee: Varun Sharma
>            Priority: Critical
>             Fix For: 0.94.8
>
>         Attachments: 8389-0.94.txt, 8389-0.94-v2.txt, 8389-0.94-v3.txt, 
> 8389-0.94-v4.txt, 8389-0.94-v5.txt, 8389-0.94-v6.txt, 8389-trunk-v1.txt, 
> 8389-trunk-v2.patch, 8389-trunk-v2.txt, 8389-trunk-v3.txt, nn1.log, nn.log, 
> sample.patch
>
>
> We ran hbase 0.94.3 patched with 8354 and observed too many outstanding lease 
> recoveries because of the short retry interval of 1 second between lease 
> recoveries.
> The namenode gets into the following loop:
> 1) Receives lease recovery request and initiates recovery choosing a primary 
> datanode every second
> 2) A lease recovery is successful and the namenode tries to commit the block 
> under recovery as finalized - this takes < 10 seconds in our environment 
> since we run with tight HDFS socket timeouts.
> 3) At step 2), there is a more recent recovery enqueued because of the 
> aggressive retries. This causes the committed block to get preempted and we 
> enter a vicious cycle
> So we do,  <initiate_recovery> --> <commit_block> --> 
> <commit_preempted_by_another_recovery>
> This loop is paused after 300 seconds which is the 
> "hbase.lease.recovery.timeout". Hence the MTTR we are observing is 5 minutes 
> which is terrible. Our ZK session timeout is 30 seconds and HDFS stale node 
> detection timeout is 20 seconds.
> Note that before the patch, we do not call recoverLease so aggressively - 
> also it seems that the HDFS namenode is pretty dumb in that it keeps 
> initiating new recoveries for every call. Before the patch, we call 
> recoverLease, assume that the block was recovered, try to get the file, it 
> has zero length since its under recovery, we fail the task and retry until we 
> get a non zero length. So things just work.
> Fixes:
> 1) Expecting recovery to occur within 1 second is too aggressive. We need to 
> have a more generous timeout. The timeout needs to be configurable since 
> typically, the recovery takes as much time as the DFS timeouts. The primary 
> datanode doing the recovery tries to reconcile the blocks and hits the 
> timeouts when it tries to contact the dead node. So the recovery is as fast 
> as the HDFS timeouts.
> 2) We have another issue I report in HDFS 4721. The Namenode chooses the 
> stale datanode to perform the recovery (since its still alive). Hence the 
> first recovery request is bound to fail. So if we want a tight MTTR, we 
> either need something like HDFS 4721 or we need something like this
>   recoverLease(...)
>   sleep(1000)
>   recoverLease(...)
>   sleep(configuredTimeout)
>   recoverLease(...)
>   sleep(configuredTimeout)
> Where configuredTimeout should be large enough to let the recovery happen but 
> the first timeout is short so that we get past the moot recovery in step #1.
>  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to