Zheng Hu created HBASE-22381:
--------------------------------
Summary: The write request won't refresh its HConnection's local
meta cache once an RegionServer got stuck
Key: HBASE-22381
URL: https://issues.apache.org/jira/browse/HBASE-22381
Project: HBase
Issue Type: Bug
Reporter: Zheng Hu
Assignee: Zheng Hu
In production environment (Provided by [~xinxin fan] from Netease, HBase
version: 1.2.6), we found a case:
1. an RegionServer got stuck;
2. all requests are write requests, and thrown an exception like this:
{code}
Caused by: java.net.SocketTimeoutException: 15000 millis timeout while waiting
for channel to be ready for read. ch :
java.nio.channels.SocketChannel[connected local=/10.130.88.181:59049
remote=hbase699.hz.163.org/10.120.192.76:60020] at
org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:164) at
org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:161) at
org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:131) at
java.io.FilterInputStream.read(FilterInputStream.java:133) at
java.io.FilterInputStream.read(FilterInputStream.java:133) at
org.apache.hadoop.hbase.ipc.RpcClient$Connection$PingInputStream.read(RpcClient.java:558)
at java.io.BufferedInputStream.fill(BufferedInputStream.java:246) at
java.io.BufferedInputStream.read(BufferedInputStream.java:265) at
java.io.DataInputStream.readInt(DataInputStream.java:387) at
org.apache.hadoop.hbase.ipc.RpcClient$Connection.readResponse(RpcClient.java:1076)
at org.apache.hadoop.hbase.ipc.RpcClient$Connection.run(RpcClient.java:727)
{code}
3. all write request to the stuck region server never clear their client's
local meta cache, and requested to the stuck server endlessly, which lead to
the availability < 100% in a long time.
I checked the code, and found that in our
AsyncRequestFutureImpl#receiveGlobalFailure:
{code}
private void receiveGlobalFailure(
//....
updateCachedLocations(server, regionName, row,
ClientExceptionsUtil.isMetaClearingException(t) ? null : t);
//....
}
{code}
The isMetaClearingException won't consider the SocketTimeoutException.
{code}
public static boolean isMetaClearingException(Throwable cur) {
cur = findException(cur);
if (cur == null) {
return true;
}
return !isSpecialException(cur) || (cur instanceof RegionMovedException)
|| cur instanceof NotServingRegionException;
}
public static boolean isSpecialException(Throwable cur) {
return (cur instanceof RegionMovedException || cur instanceof
RegionOpeningException
|| cur instanceof RegionTooBusyException || cur instanceof
RpcThrottlingException
|| cur instanceof MultiActionResultTooLarge || cur instanceof
RetryImmediatelyException
|| cur instanceof CallQueueTooBigException || cur instanceof
CallDroppedException
|| cur instanceof NotServingRegionException || cur instanceof
RequestTooBigException);
}
{code}
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)