[
https://issues.apache.org/jira/browse/HBASE-8389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13643136#comment-13643136
]
Varun Sharma commented on HBASE-8389:
-------------------------------------
[[email protected]]
I can do a small write up that folks can refer to.
[~nkeywal]
One point regarding the low setting though. Its good for fast MTTR requirements
such as online clusters but it does not work well if you pound a small cluster
with mapreduce jobs. The write timeouts start kicking in on datanodes - we saw
this on a small cluster. So it has to be taken with a pinch of salt.
I think 4 seconds might be too tight. Because we have the following sequence -
1) recoverLease called
2) The primary node heartbeats (this can be 3 seconds in the worst case)
3) There are multiple timeouts during recovery at primary datanode:
a) dfs.socket.timeout kicks in when we suspend the processes using "kill
-STOP" - there is only 1 retry
b) ipc.client.connect.timeout is the troublemaker - on old hadoop versions
it is hardcoded at 20 seconds. On some versions, the # of retries is hardcoded
at 45. This can be trigger by firewalling a host using iptables to drop all
incoming/outgoing TCP packets. Another issue here is that b/w the timeouts
there is a 1 second hardcoded sleep :) - I just fixed it in HADOOP 9503. If we
make sure that all the dfs.socket.timeout and ipc client settings are the same
in hbase-site.xml and hdfs-site.xml. Then, we can
The retry rate should be no faster than 3a and 3b - or lease recoveries will
accumulate for 900 seconds in trunk. To get around this problem, we would want
to make sure that hbase-site.xml has the same settings as hdfs-site.xml. And we
calculate the recovery interval from those settings. Otherwise, we can leave a
release note saying that this number should be max(dfs.socket.timeout,
ipc.client.connect.max.retries.on.timeouts * ipc.client.connect.timeout,
ipc.client.connect.max.retries).
The advantage of having HDFS 4721 is that at some point the data node will be
recognized as stale - maybe a little later than hdfs recovery. Once that
happens, recoveries typically occuring within 2 seconds.
> HBASE-8354 forces Namenode into loop with lease recovery requests
> -----------------------------------------------------------------
>
> Key: HBASE-8389
> URL: https://issues.apache.org/jira/browse/HBASE-8389
> Project: HBase
> Issue Type: Bug
> Reporter: Varun Sharma
> Assignee: Varun Sharma
> Priority: Critical
> Fix For: 0.94.8
>
> Attachments: 8389-0.94.txt, 8389-0.94-v2.txt, 8389-0.94-v3.txt,
> 8389-0.94-v4.txt, 8389-0.94-v5.txt, 8389-0.94-v6.txt, 8389-trunk-v1.txt,
> 8389-trunk-v2.patch, 8389-trunk-v2.txt, 8389-trunk-v3.txt, nn1.log, nn.log,
> sample.patch
>
>
> We ran hbase 0.94.3 patched with 8354 and observed too many outstanding lease
> recoveries because of the short retry interval of 1 second between lease
> recoveries.
> The namenode gets into the following loop:
> 1) Receives lease recovery request and initiates recovery choosing a primary
> datanode every second
> 2) A lease recovery is successful and the namenode tries to commit the block
> under recovery as finalized - this takes < 10 seconds in our environment
> since we run with tight HDFS socket timeouts.
> 3) At step 2), there is a more recent recovery enqueued because of the
> aggressive retries. This causes the committed block to get preempted and we
> enter a vicious cycle
> So we do, <initiate_recovery> --> <commit_block> -->
> <commit_preempted_by_another_recovery>
> This loop is paused after 300 seconds which is the
> "hbase.lease.recovery.timeout". Hence the MTTR we are observing is 5 minutes
> which is terrible. Our ZK session timeout is 30 seconds and HDFS stale node
> detection timeout is 20 seconds.
> Note that before the patch, we do not call recoverLease so aggressively -
> also it seems that the HDFS namenode is pretty dumb in that it keeps
> initiating new recoveries for every call. Before the patch, we call
> recoverLease, assume that the block was recovered, try to get the file, it
> has zero length since its under recovery, we fail the task and retry until we
> get a non zero length. So things just work.
> Fixes:
> 1) Expecting recovery to occur within 1 second is too aggressive. We need to
> have a more generous timeout. The timeout needs to be configurable since
> typically, the recovery takes as much time as the DFS timeouts. The primary
> datanode doing the recovery tries to reconcile the blocks and hits the
> timeouts when it tries to contact the dead node. So the recovery is as fast
> as the HDFS timeouts.
> 2) We have another issue I report in HDFS 4721. The Namenode chooses the
> stale datanode to perform the recovery (since its still alive). Hence the
> first recovery request is bound to fail. So if we want a tight MTTR, we
> either need something like HDFS 4721 or we need something like this
> recoverLease(...)
> sleep(1000)
> recoverLease(...)
> sleep(configuredTimeout)
> recoverLease(...)
> sleep(configuredTimeout)
> Where configuredTimeout should be large enough to let the recovery happen but
> the first timeout is short so that we get past the moot recovery in step #1.
>
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira