[
https://issues.apache.org/jira/browse/HBASE-8449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13665613#comment-13665613
]
Nicolas Liochon commented on HBASE-8449:
----------------------------------------
Increase hbase.lease.recovery.timeout default to 15 minutes, i.e. more than a
standard hdfs recovery.
hbase.lease.recovery.dfs.timeout: it should not be less than 10s imho. It's not
only a question of dfs timeout, it's as well that it seems that the NN seems
not to like multiple calls to the recoverLease. I tested again multiple calls,
the datanodes logs were complaining about "situation that should never occurs".
Ok, it was with multi calls with an interval of 1 second, but it seems to be
all luck.
+ * 1. Call recoverLease.
+ * 2. If it returns true, break.
+ * 3. If it returns false, wait a few seconds and then call it again.
+ * 4. If it returns true, break.
+ * 5. If it returns false, wait for what we think the datanode socket
timeout is
+ * (configurable) and then try again.
+ * 6. If it returns true, break.
+ * 7. If it returns false, repeat starting at step 5. above.
I would propose:
the master
- if HDFS-4754 is there, the master marks the node as stale as the first
step of the recovery.
- The master calls recover lease as a part of the distributed split. We can
enhance it in an other jira to give higher priority to closed wals vs. wals
being recovered.
the region server:
- calls isFileCLosed, if it's there. if true returns
- Calls recoverLease, if true, return
- if isFileCLosed is available, loop on it with a 1s sleep
- else loops on 70s (configurable) sleep with recover lease
> Refactor recoverLease retries and pauses informed by findings over in
> hbase-8389
> --------------------------------------------------------------------------------
>
> Key: HBASE-8449
> URL: https://issues.apache.org/jira/browse/HBASE-8449
> Project: HBase
> Issue Type: Bug
> Components: Filesystem Integration
> Affects Versions: 0.94.7, 0.95.0
> Reporter: stack
> Assignee: stack
> Priority: Critical
> Fix For: 0.95.1
>
> Attachments: 8449.txt, 8449v2.txt, 8449v3.txt, 8449v4.txt
>
>
> HBASE-8359 is an interesting issue that roams near and far. This issue is
> about making use of the findings handily summarized on the end of hbase-8359
> which have it that trunk needs refactor around how it does its recoverLease
> handling (and that the patch committed against HBASE-8359 is not what we want
> going forward).
> This issue is about making a patch that adds a lag between recoverLease
> invocations where the lag is related to dfs timeouts -- the hdfs-side dfs
> timeout -- and optionally makes use of the isFileClosed API if it is
> available (a facility that is not yet committed to a branch near you and
> unlikely to be within your locality with a good while to come).
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira