Zheng Hu created HBASE-22381:
--------------------------------

             Summary: The write request won't refresh its HConnection's local 
meta cache once an RegionServer got stuck
                 Key: HBASE-22381
                 URL: https://issues.apache.org/jira/browse/HBASE-22381
             Project: HBase
          Issue Type: Bug
            Reporter: Zheng Hu
            Assignee: Zheng Hu


In production environment (Provided by [~xinxin fan] from Netease, HBase 
version: 1.2.6), we found a case: 
1. an RegionServer got stuck;
2. all requests are write requests, and  thrown an exception like this: 
{code}
Caused by: java.net.SocketTimeoutException: 15000 millis timeout while waiting 
for channel to be ready for read. ch : 
java.nio.channels.SocketChannel[connected local=/10.130.88.181:59049 
remote=hbase699.hz.163.org/10.120.192.76:60020] at 
org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:164) at 
org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:161) at 
org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:131) at 
java.io.FilterInputStream.read(FilterInputStream.java:133) at 
java.io.FilterInputStream.read(FilterInputStream.java:133) at 
org.apache.hadoop.hbase.ipc.RpcClient$Connection$PingInputStream.read(RpcClient.java:558)
 at java.io.BufferedInputStream.fill(BufferedInputStream.java:246) at 
java.io.BufferedInputStream.read(BufferedInputStream.java:265) at 
java.io.DataInputStream.readInt(DataInputStream.java:387) at 
org.apache.hadoop.hbase.ipc.RpcClient$Connection.readResponse(RpcClient.java:1076)
 at org.apache.hadoop.hbase.ipc.RpcClient$Connection.run(RpcClient.java:727)
{code}
3.  all write request to the stuck region server never clear their client's 
local meta cache, and requested to the stuck server endlessly,   which lead to 
the availability < 100% in a long time.

I checked the code, and found that in our 
AsyncRequestFutureImpl#receiveGlobalFailure: 

{code}
  private void receiveGlobalFailure(
     //....
      updateCachedLocations(server, regionName, row,
        ClientExceptionsUtil.isMetaClearingException(t) ? null : t);
     //....
   }
{code}

The isMetaClearingException won't consider the SocketTimeoutException.

{code}
  public static boolean isMetaClearingException(Throwable cur) {
    cur = findException(cur);

    if (cur == null) {
      return true;
    }
    return !isSpecialException(cur) || (cur instanceof RegionMovedException)
        || cur instanceof NotServingRegionException;
  }

  public static boolean isSpecialException(Throwable cur) {
    return (cur instanceof RegionMovedException || cur instanceof 
RegionOpeningException
        || cur instanceof RegionTooBusyException || cur instanceof 
RpcThrottlingException
        || cur instanceof MultiActionResultTooLarge || cur instanceof 
RetryImmediatelyException
        || cur instanceof CallQueueTooBigException || cur instanceof 
CallDroppedException
        || cur instanceof NotServingRegionException || cur instanceof 
RequestTooBigException);
  }
{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to