[
https://issues.apache.org/jira/browse/HBASE-8389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13640785#comment-13640785
]
Varun Sharma commented on HBASE-8389:
-------------------------------------
Okay I did some testing with v5 and the MTTR was pretty good - 2-3 minutes -
log splitting took 20-30 seconds - basically around 1 minute of this time was
to replay the edits from the recovered_edits. This was with the stale node
patches (3703 and 3912) however for HDFS. Also had tight dfs.socket.timeout=30.
I mostly, suspended the Datanode and the region server packages at the same
time. I also ran a test where I used iptables to firewall against all traffic
to the host except "ssh" traffic.
However, the weird thing was, I also tried to reproduce the failure scenario
above, with is setting the timeout at 1 second and I could not. I looked into
the NN logs and this is what happened. Lease recovery was called the 1st time
and a block recovery was initiated with the dead datanode (no HDFS 4721). Lease
recovery was called the 2nd time and it returned true almost every time I ran
these tests.
This is something that I did not see, the last time around. The logs I attached
above show that a release recovery is called once by one SplitLogWorker,
followed by 25 calls by another worker, followed by another 25 and eventually
hundreds of calls the 3rd time. The 25 calls make sense since each split worker
has a task level timeout of 25 seconds and we do recoverLease every second.
Also there are 3 resubmissions, so the last worker is trying to get back the
lease. I wonder if I hit a race condition which I can no longer reproduce,
where one worker had the lease and did not give it up and subsequent workers
just failed to recover the lease. In which case, 8354 is not the culprit but I
still prefer the more relaxed timeout in this JIRA.
Also, I am now a little confused with lease recovery. It seems that lease
recovery can be separate from block recovery. Basically, recover lease is
called the first time, we enqueue a block recovery (which is never going to
happen since we try to hit the dead datanode thats not heartbeating). However
the 2nd call still returns true which confuses me since the block is still not
finalized.
I wonder if lease recovery means anything other than, flipping something at the
namenode saying who has the lease to the file. But its quite possible that the
underlying block/file has not truly been recovered.
[~ecn]
Do you see something similar in your namenode logs as you kill, lease recovery
initiated but no real block recovery/commitSynchronization messages (both
regionserver + datanode) ? When we kill region server + datanode, we basically
kill the primary or the first datanode which holds the block - this is the same
datanode which would be chosen for block recovery..
Thanks
Varun
> HBASE-8354 forces Namenode into loop with lease recovery requests
> -----------------------------------------------------------------
>
> Key: HBASE-8389
> URL: https://issues.apache.org/jira/browse/HBASE-8389
> Project: HBase
> Issue Type: Bug
> Reporter: Varun Sharma
> Assignee: Varun Sharma
> Priority: Critical
> Fix For: 0.94.8
>
> Attachments: 8389-0.94.txt, 8389-0.94-v2.txt, 8389-0.94-v3.txt,
> 8389-0.94-v4.txt, 8389-0.94-v5.txt, 8389-0.94-v6.txt, 8389-trunk-v1.txt,
> 8389-trunk-v2.patch, 8389-trunk-v2.txt, 8389-trunk-v3.txt, nn1.log, nn.log,
> sample.patch
>
>
> We ran hbase 0.94.3 patched with 8354 and observed too many outstanding lease
> recoveries because of the short retry interval of 1 second between lease
> recoveries.
> The namenode gets into the following loop:
> 1) Receives lease recovery request and initiates recovery choosing a primary
> datanode every second
> 2) A lease recovery is successful and the namenode tries to commit the block
> under recovery as finalized - this takes < 10 seconds in our environment
> since we run with tight HDFS socket timeouts.
> 3) At step 2), there is a more recent recovery enqueued because of the
> aggressive retries. This causes the committed block to get preempted and we
> enter a vicious cycle
> So we do, <initiate_recovery> --> <commit_block> -->
> <commit_preempted_by_another_recovery>
> This loop is paused after 300 seconds which is the
> "hbase.lease.recovery.timeout". Hence the MTTR we are observing is 5 minutes
> which is terrible. Our ZK session timeout is 30 seconds and HDFS stale node
> detection timeout is 20 seconds.
> Note that before the patch, we do not call recoverLease so aggressively -
> also it seems that the HDFS namenode is pretty dumb in that it keeps
> initiating new recoveries for every call. Before the patch, we call
> recoverLease, assume that the block was recovered, try to get the file, it
> has zero length since its under recovery, we fail the task and retry until we
> get a non zero length. So things just work.
> Fixes:
> 1) Expecting recovery to occur within 1 second is too aggressive. We need to
> have a more generous timeout. The timeout needs to be configurable since
> typically, the recovery takes as much time as the DFS timeouts. The primary
> datanode doing the recovery tries to reconcile the blocks and hits the
> timeouts when it tries to contact the dead node. So the recovery is as fast
> as the HDFS timeouts.
> 2) We have another issue I report in HDFS 4721. The Namenode chooses the
> stale datanode to perform the recovery (since its still alive). Hence the
> first recovery request is bound to fail. So if we want a tight MTTR, we
> either need something like HDFS 4721 or we need something like this
> recoverLease(...)
> sleep(1000)
> recoverLease(...)
> sleep(configuredTimeout)
> recoverLease(...)
> sleep(configuredTimeout)
> Where configuredTimeout should be large enough to let the recovery happen but
> the first timeout is short so that we get past the moot recovery in step #1.
>
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira