[ https://issues.apache.org/jira/browse/HBASE-8389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13643136#comment-13643136 ]
Varun Sharma commented on HBASE-8389: ------------------------------------- [~saint....@gmail.com] I can do a small write up that folks can refer to. [~nkeywal] One point regarding the low setting though. Its good for fast MTTR requirements such as online clusters but it does not work well if you pound a small cluster with mapreduce jobs. The write timeouts start kicking in on datanodes - we saw this on a small cluster. So it has to be taken with a pinch of salt. I think 4 seconds might be too tight. Because we have the following sequence - 1) recoverLease called 2) The primary node heartbeats (this can be 3 seconds in the worst case) 3) There are multiple timeouts during recovery at primary datanode: a) dfs.socket.timeout kicks in when we suspend the processes using "kill -STOP" - there is only 1 retry b) ipc.client.connect.timeout is the troublemaker - on old hadoop versions it is hardcoded at 20 seconds. On some versions, the # of retries is hardcoded at 45. This can be trigger by firewalling a host using iptables to drop all incoming/outgoing TCP packets. Another issue here is that b/w the timeouts there is a 1 second hardcoded sleep :) - I just fixed it in HADOOP 9503. If we make sure that all the dfs.socket.timeout and ipc client settings are the same in hbase-site.xml and hdfs-site.xml. Then, we can The retry rate should be no faster than 3a and 3b - or lease recoveries will accumulate for 900 seconds in trunk. To get around this problem, we would want to make sure that hbase-site.xml has the same settings as hdfs-site.xml. And we calculate the recovery interval from those settings. Otherwise, we can leave a release note saying that this number should be max(dfs.socket.timeout, ipc.client.connect.max.retries.on.timeouts * ipc.client.connect.timeout, ipc.client.connect.max.retries). The advantage of having HDFS 4721 is that at some point the data node will be recognized as stale - maybe a little later than hdfs recovery. Once that happens, recoveries typically occuring within 2 seconds. > HBASE-8354 forces Namenode into loop with lease recovery requests > ----------------------------------------------------------------- > > Key: HBASE-8389 > URL: https://issues.apache.org/jira/browse/HBASE-8389 > Project: HBase > Issue Type: Bug > Reporter: Varun Sharma > Assignee: Varun Sharma > Priority: Critical > Fix For: 0.94.8 > > Attachments: 8389-0.94.txt, 8389-0.94-v2.txt, 8389-0.94-v3.txt, > 8389-0.94-v4.txt, 8389-0.94-v5.txt, 8389-0.94-v6.txt, 8389-trunk-v1.txt, > 8389-trunk-v2.patch, 8389-trunk-v2.txt, 8389-trunk-v3.txt, nn1.log, nn.log, > sample.patch > > > We ran hbase 0.94.3 patched with 8354 and observed too many outstanding lease > recoveries because of the short retry interval of 1 second between lease > recoveries. > The namenode gets into the following loop: > 1) Receives lease recovery request and initiates recovery choosing a primary > datanode every second > 2) A lease recovery is successful and the namenode tries to commit the block > under recovery as finalized - this takes < 10 seconds in our environment > since we run with tight HDFS socket timeouts. > 3) At step 2), there is a more recent recovery enqueued because of the > aggressive retries. This causes the committed block to get preempted and we > enter a vicious cycle > So we do, <initiate_recovery> --> <commit_block> --> > <commit_preempted_by_another_recovery> > This loop is paused after 300 seconds which is the > "hbase.lease.recovery.timeout". Hence the MTTR we are observing is 5 minutes > which is terrible. Our ZK session timeout is 30 seconds and HDFS stale node > detection timeout is 20 seconds. > Note that before the patch, we do not call recoverLease so aggressively - > also it seems that the HDFS namenode is pretty dumb in that it keeps > initiating new recoveries for every call. Before the patch, we call > recoverLease, assume that the block was recovered, try to get the file, it > has zero length since its under recovery, we fail the task and retry until we > get a non zero length. So things just work. > Fixes: > 1) Expecting recovery to occur within 1 second is too aggressive. We need to > have a more generous timeout. The timeout needs to be configurable since > typically, the recovery takes as much time as the DFS timeouts. The primary > datanode doing the recovery tries to reconcile the blocks and hits the > timeouts when it tries to contact the dead node. So the recovery is as fast > as the HDFS timeouts. > 2) We have another issue I report in HDFS 4721. The Namenode chooses the > stale datanode to perform the recovery (since its still alive). Hence the > first recovery request is bound to fail. So if we want a tight MTTR, we > either need something like HDFS 4721 or we need something like this > recoverLease(...) > sleep(1000) > recoverLease(...) > sleep(configuredTimeout) > recoverLease(...) > sleep(configuredTimeout) > Where configuredTimeout should be large enough to let the recovery happen but > the first timeout is short so that we get past the moot recovery in step #1. > -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira