Liu Shaohui created HBASE-12534:
-----------------------------------
Summary: Wrong region location cache in client after regions are
moved
Key: HBASE-12534
URL: https://issues.apache.org/jira/browse/HBASE-12534
Project: HBase
Issue Type: Bug
Affects Versions: 0.94.24
Reporter: Liu Shaohui
Assignee: Liu Shaohui
Priority: Critical
In our 0.94 hbase cluster, we found that client got wrong region location cache
and did not update it after a region is moved to another regionserver.
The reason is wrong client config and bug in RpcRetryingCaller of hbase client.
The rpc configs are following:
{code}
hbase.rpc.timeout=1000
hbase.client.pause=200
hbase.client.operation.timeout=1200
{code}
But the client retry number is 3
{code}
hbase.client.retries.number=3
{code}
Assumed that a region is at regionserver A before, and then it is moved to
regionserver B. The client try to make a call to regionserver A and get an
NotServingRegionException. For the rety number is not 1, the region server
location cache is not cleaned. See: RpcRetryingCaller.java#141 and
RegionServerCallable.java#127
{code}
@Override
public void throwable(Throwable t, boolean retrying) {
if (t instanceof SocketTimeoutException ||
....
} else if (t instanceof NotServingRegionException && !retrying) {
// Purge cache entries for this specific region from hbase:meta cache
// since we don't call connect(true) when number of retries is 1.
getConnection().deleteCachedRegionLocation(location);
}
}
{code}
But the call did not retry and throw an SocketTimeoutException for the time the
call will take is larger than the operation timeout.See
RpcRetryingCaller.java#152
{code}
expectedSleep = callable.sleep(pause, tries + 1);
// If, after the planned sleep, there won't be enough time left, we
stop now.
long duration = singleCallDuration(expectedSleep);
if (duration > callTimeout) {
String msg = "callTimeout=" + callTimeout + ", callDuration=" +
duration +
": " + callable.getExceptionMessageAdditionalDetail();
throw (SocketTimeoutException)(new
SocketTimeoutException(msg).initCause(t));
}
{code}
At last, the wrong region location will never be not cleaned up .
[~lhofhansl]
In hbase 0.94, the MIN_RPC_TIMEOUT in singleCallDuration is 2000 in default,
which trigger this bug.
{code}
private long singleCallDuration(final long expectedSleep) {
return (EnvironmentEdgeManager.currentTimeMillis() - this.globalStartTime)
+ MIN_RPC_TIMEOUT + expectedSleep;
}
{code}
But there is risk in master code too.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)