Zheng Hu created HBASE-22381: -------------------------------- Summary: The write request won't refresh its HConnection's local meta cache once an RegionServer got stuck Key: HBASE-22381 URL: https://issues.apache.org/jira/browse/HBASE-22381 Project: HBase Issue Type: Bug Reporter: Zheng Hu Assignee: Zheng Hu
In production environment (Provided by [~xinxin fan] from Netease, HBase version: 1.2.6), we found a case: 1. an RegionServer got stuck; 2. all requests are write requests, and thrown an exception like this: {code} Caused by: java.net.SocketTimeoutException: 15000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/10.130.88.181:59049 remote=hbase699.hz.163.org/10.120.192.76:60020] at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:164) at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:161) at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:131) at java.io.FilterInputStream.read(FilterInputStream.java:133) at java.io.FilterInputStream.read(FilterInputStream.java:133) at org.apache.hadoop.hbase.ipc.RpcClient$Connection$PingInputStream.read(RpcClient.java:558) at java.io.BufferedInputStream.fill(BufferedInputStream.java:246) at java.io.BufferedInputStream.read(BufferedInputStream.java:265) at java.io.DataInputStream.readInt(DataInputStream.java:387) at org.apache.hadoop.hbase.ipc.RpcClient$Connection.readResponse(RpcClient.java:1076) at org.apache.hadoop.hbase.ipc.RpcClient$Connection.run(RpcClient.java:727) {code} 3. all write request to the stuck region server never clear their client's local meta cache, and requested to the stuck server endlessly, which lead to the availability < 100% in a long time. I checked the code, and found that in our AsyncRequestFutureImpl#receiveGlobalFailure: {code} private void receiveGlobalFailure( //.... updateCachedLocations(server, regionName, row, ClientExceptionsUtil.isMetaClearingException(t) ? null : t); //.... } {code} The isMetaClearingException won't consider the SocketTimeoutException. {code} public static boolean isMetaClearingException(Throwable cur) { cur = findException(cur); if (cur == null) { return true; } return !isSpecialException(cur) || (cur instanceof RegionMovedException) || cur instanceof NotServingRegionException; } public static boolean isSpecialException(Throwable cur) { return (cur instanceof RegionMovedException || cur instanceof RegionOpeningException || cur instanceof RegionTooBusyException || cur instanceof RpcThrottlingException || cur instanceof MultiActionResultTooLarge || cur instanceof RetryImmediatelyException || cur instanceof CallQueueTooBigException || cur instanceof CallDroppedException || cur instanceof NotServingRegionException || cur instanceof RequestTooBigException); } {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)