Liu Shaohui created HBASE-12534:
-----------------------------------

             Summary: Wrong region location cache in client after regions are 
moved
                 Key: HBASE-12534
                 URL: https://issues.apache.org/jira/browse/HBASE-12534
             Project: HBase
          Issue Type: Bug
    Affects Versions: 0.94.24
            Reporter: Liu Shaohui
            Assignee: Liu Shaohui
            Priority: Critical


In our 0.94 hbase cluster, we found that client got wrong region location cache 
and did not update it after a region is moved to another regionserver.
The reason is wrong client config and bug in RpcRetryingCaller  of hbase client.
The rpc configs are following:
{code}
hbase.rpc.timeout=1000
hbase.client.pause=200
hbase.client.operation.timeout=1200
{code}
But the client retry number is 3
{code}
hbase.client.retries.number=3
{code}
Assumed that a region is at regionserver A before, and then it is moved to 
regionserver B. The client try to make a  call to regionserver A and get an 
NotServingRegionException. For the rety number is not 1, the region server 
location cache is not cleaned. See: RpcRetryingCaller.java#141 and 
RegionServerCallable.java#127
{code}
  @Override
  public void throwable(Throwable t, boolean retrying) {
    if (t instanceof SocketTimeoutException ||
      ....
    } else if (t instanceof NotServingRegionException && !retrying) {
      // Purge cache entries for this specific region from hbase:meta cache
      // since we don't call connect(true) when number of retries is 1.
      getConnection().deleteCachedRegionLocation(location);
    }
  }
{code}
But the call did not retry and throw an SocketTimeoutException for the time the 
call will take is larger than the operation timeout.See 
RpcRetryingCaller.java#152
{code}
        expectedSleep = callable.sleep(pause, tries + 1);

        // If, after the planned sleep, there won't be enough time left, we 
stop now.
        long duration = singleCallDuration(expectedSleep);
        if (duration > callTimeout) {
          String msg = "callTimeout=" + callTimeout + ", callDuration=" + 
duration +
              ": " + callable.getExceptionMessageAdditionalDetail();
          throw (SocketTimeoutException)(new 
SocketTimeoutException(msg).initCause(t));
        }
{code}

At last, the wrong region location will never be not cleaned up . 

[~lhofhansl]
In hbase 0.94, the MIN_RPC_TIMEOUT in singleCallDuration is 2000 in default, 
which trigger this bug. 
{code}
  private long singleCallDuration(final long expectedSleep) {
    return (EnvironmentEdgeManager.currentTimeMillis() - this.globalStartTime)
      + MIN_RPC_TIMEOUT + expectedSleep;
  }
{code}
But there is risk in master code too.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to