[
https://issues.apache.org/jira/browse/HBASE-12534?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14220803#comment-14220803
]
Hadoop QA commented on HBASE-12534:
-----------------------------------
{color:red}-1 overall{color}. Here are the results of testing the latest
attachment
http://issues.apache.org/jira/secure/attachment/12682857/HBASE-12534-0.94-v1.diff
against master branch at commit 325cdc0987f8176ac46695f5b0c93b0fc6605ab9.
ATTACHMENT ID: 12682857
{color:green}+1 @author{color}. The patch does not contain any @author
tags.
{color:red}-1 tests included{color}. The patch doesn't appear to include
any new or modified tests.
Please justify why no new tests are needed for this
patch.
Also please list what manual steps were performed to
verify this patch.
{color:red}-1 patch{color}. The patch command could not apply the patch.
Console output:
https://builds.apache.org/job/PreCommit-HBASE-Build/11776//console
This message is automatically generated.
> Wrong region location cache in client after regions are moved
> -------------------------------------------------------------
>
> Key: HBASE-12534
> URL: https://issues.apache.org/jira/browse/HBASE-12534
> Project: HBase
> Issue Type: Bug
> Affects Versions: 2.0.0
> Reporter: Liu Shaohui
> Assignee: Liu Shaohui
> Priority: Critical
> Labels: client
> Attachments: HBASE-12534-0.94-v1.diff, HBASE-12534-v1.diff
>
>
> In our 0.94 hbase cluster, we found that client got wrong region location
> cache and did not update it after a region is moved to another regionserver.
> The reason is wrong client config and bug in RpcRetryingCaller of hbase
> client.
> The rpc configs are following:
> {code}
> hbase.rpc.timeout=1000
> hbase.client.pause=200
> hbase.client.operation.timeout=1200
> {code}
> But the client retry number is 3
> {code}
> hbase.client.retries.number=3
> {code}
> Assumed that a region is at regionserver A before, and then it is moved to
> regionserver B. The client try to make a call to regionserver A and get an
> NotServingRegionException. For the rety number is not 1, the region server
> location cache is not cleaned. See: RpcRetryingCaller.java#141 and
> RegionServerCallable.java#127
> {code}
> @Override
> public void throwable(Throwable t, boolean retrying) {
> if (t instanceof SocketTimeoutException ||
> ....
> } else if (t instanceof NotServingRegionException && !retrying) {
> // Purge cache entries for this specific region from hbase:meta cache
> // since we don't call connect(true) when number of retries is 1.
> getConnection().deleteCachedRegionLocation(location);
> }
> }
> {code}
> But the call did not retry and throw an SocketTimeoutException for the time
> the call will take is larger than the operation timeout.See
> RpcRetryingCaller.java#152
> {code}
> expectedSleep = callable.sleep(pause, tries + 1);
> // If, after the planned sleep, there won't be enough time left, we
> stop now.
> long duration = singleCallDuration(expectedSleep);
> if (duration > callTimeout) {
> String msg = "callTimeout=" + callTimeout + ", callDuration=" +
> duration +
> ": " + callable.getExceptionMessageAdditionalDetail();
> throw (SocketTimeoutException)(new
> SocketTimeoutException(msg).initCause(t));
> }
> {code}
> At last, the wrong region location will never be not cleaned up .
> [~lhofhansl]
> In hbase 0.94, the MIN_RPC_TIMEOUT in singleCallDuration is 2000 in default,
> which trigger this bug.
> {code}
> private long singleCallDuration(final long expectedSleep) {
> return (EnvironmentEdgeManager.currentTimeMillis() - this.globalStartTime)
> + MIN_RPC_TIMEOUT + expectedSleep;
> }
> {code}
> But there is risk in master code too.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)