[
https://issues.apache.org/jira/browse/HBASE-12534?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14224941#comment-14224941
]
Lars Hofhansl commented on HBASE-12534:
---------------------------------------
With that explanation it seems we can simply get rid of MIN_RPC_TIMEOUT. If
somebody wants to set the rpc timeout low, (s)he should be free to do so. If
that timeout is set too low for the environment in question that's their
problem to fix.
> Wrong region location cache in client after regions are moved
> -------------------------------------------------------------
>
> Key: HBASE-12534
> URL: https://issues.apache.org/jira/browse/HBASE-12534
> Project: HBase
> Issue Type: Bug
> Affects Versions: 2.0.0
> Reporter: Liu Shaohui
> Assignee: Liu Shaohui
> Priority: Critical
> Labels: client
> Attachments: HBASE-12534-0.94-v1.diff, HBASE-12534-v1.diff
>
>
> In our 0.94 hbase cluster, we found that client got wrong region location
> cache and did not update it after a region is moved to another regionserver.
> The reason is wrong client config and bug in RpcRetryingCaller of hbase
> client.
> The rpc configs are following:
> {code}
> hbase.rpc.timeout=1000
> hbase.client.pause=200
> hbase.client.operation.timeout=1200
> {code}
> But the client retry number is 3
> {code}
> hbase.client.retries.number=3
> {code}
> Assumed that a region is at regionserver A before, and then it is moved to
> regionserver B. The client try to make a call to regionserver A and get an
> NotServingRegionException. For the rety number is not 1, the region server
> location cache is not cleaned. See: RpcRetryingCaller.java#141 and
> RegionServerCallable.java#127
> {code}
> @Override
> public void throwable(Throwable t, boolean retrying) {
> if (t instanceof SocketTimeoutException ||
> ....
> } else if (t instanceof NotServingRegionException && !retrying) {
> // Purge cache entries for this specific region from hbase:meta cache
> // since we don't call connect(true) when number of retries is 1.
> getConnection().deleteCachedRegionLocation(location);
> }
> }
> {code}
> But the call did not retry and throw an SocketTimeoutException for the time
> the call will take is larger than the operation timeout.See
> RpcRetryingCaller.java#152
> {code}
> expectedSleep = callable.sleep(pause, tries + 1);
> // If, after the planned sleep, there won't be enough time left, we
> stop now.
> long duration = singleCallDuration(expectedSleep);
> if (duration > callTimeout) {
> String msg = "callTimeout=" + callTimeout + ", callDuration=" +
> duration +
> ": " + callable.getExceptionMessageAdditionalDetail();
> throw (SocketTimeoutException)(new
> SocketTimeoutException(msg).initCause(t));
> }
> {code}
> At last, the wrong region location will never be not cleaned up .
> [~lhofhansl]
> In hbase 0.94, the MIN_RPC_TIMEOUT in singleCallDuration is 2000 in default,
> which trigger this bug.
> {code}
> private long singleCallDuration(final long expectedSleep) {
> return (EnvironmentEdgeManager.currentTimeMillis() - this.globalStartTime)
> + MIN_RPC_TIMEOUT + expectedSleep;
> }
> {code}
> But there is risk in master code too.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)