[ 
https://issues.apache.org/jira/browse/HBASE-22041?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17110601#comment-17110601
 ] 

Andrey Elenskiy commented on HBASE-22041:
-----------------------------------------

Here's the error that shows that the address is getting readded to failed 
servers:
{code:java}
2020-05-18 17:52:20,249 TRACE [RSProcedureDispatcher-pool3-t133] 
procedure.RSProcedureDispatcher: Building request with operations count=1
2020-05-18 17:52:20,249 TRACE [RSProcedureDispatcher-pool3-t133] 
ipc.NettyRpcConnection: Connecting to 
regionserver-1.hbase.hbase.svc.cluster.local/10.128.9.13:16020
2020-05-18 17:52:20,254 TRACE [RS-EventLoopGroup-1-1] ipc.AbstractRpcClient: 
Call: ExecuteProcedures, callTime: 5ms
2020-05-18 17:52:20,254 DEBUG [RS-EventLoopGroup-1-1] ipc.FailedServers: Added 
failed server with address 
regionserver-1.hbase.hbase.svc.cluster.local/10.128.9.13:16020 to list caused 
by 
org.apache.hbase.thirdparty.io.netty.channel.AbstractChannel$AnnotatedConnectException:
 finishConnect(..) failed: No route to host: 
regionserver-1.hbase.hbase.svc.cluster.local/10.128.9.13:16020
2020-05-18 17:52:20,254 DEBUG [RSProcedureDispatcher-pool3-t133] 
procedure.RSProcedureDispatcher: request to 
regionserver-1.hbase.hbase.svc.cluster.local,16020,1589824187906 failed, 
try=1480
2020-05-18 17:52:20,255 WARN [RSProcedureDispatcher-pool3-t133] 
procedure.RSProcedureDispatcher: request to server 
regionserver-1.hbase.hbase.svc.cluster.local,16020,1589824187906 failed due to 
java.net.ConnectException: Call to 
regionserver-1.hbase.hbase.svc.cluster.local/10.128.9.13:16020 failed on 
connection exception: 
org.apache.hbase.thirdparty.io.netty.channel.AbstractChannel$AnnotatedConnectException:
 finishConnect(..) failed: No route to host: 
regionserver-1.hbase.hbase.svc.cluster.local/10.128.9.13:16020, try=1480, 
retrying...
 ... 10 more
Caused by: java.net.ConnectException: finishConnect(..) failed: No route to host
 ... 6 more
 at 
org.apache.hbase.thirdparty.io.netty.channel.epoll.AbstractEpollChannel$AbstractEpollUnsafe.finishConnect(AbstractEpollChannel.java:644)
 at 
org.apache.hbase.thirdparty.io.netty.channel.epoll.AbstractEpollChannel$AbstractEpollUnsafe.doFinishConnect(AbstractEpollChannel.java:667)
 at 
org.apache.hbase.thirdparty.io.netty.channel.unix.Socket.finishConnect(Socket.java:269)
 at 
org.apache.hbase.thirdparty.io.netty.channel.unix.Errors.throwConnectException(Errors.java:112)
Caused by: 
org.apache.hbase.thirdparty.io.netty.channel.AbstractChannel$AnnotatedConnectException:
 finishConnect(..) failed: No route to host: 
regionserver-1.hbase.hbase.svc.cluster.local/10.128.9.13:16020
 at java.lang.Thread.run(Thread.java:748)
 at 
org.apache.hbase.thirdparty.io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
 at 
org.apache.hbase.thirdparty.io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:905)
 at 
org.apache.hbase.thirdparty.io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:328)
 at 
org.apache.hbase.thirdparty.io.netty.channel.epoll.EpollEventLoop.processReady(EpollEventLoop.java:417)
 at 
org.apache.hbase.thirdparty.io.netty.channel.epoll.AbstractEpollChannel$AbstractEpollUnsafe.epollOutReady(AbstractEpollChannel.java:524)
 at 
org.apache.hbase.thirdparty.io.netty.channel.epoll.AbstractEpollChannel$AbstractEpollUnsafe.finishConnect(AbstractEpollChannel.java:650)
 at 
org.apache.hbase.thirdparty.io.netty.channel.epoll.AbstractEpollChannel$AbstractEpollUnsafe.fulfillConnectPromise(AbstractEpollChannel.java:631)
 at 
org.apache.hbase.thirdparty.io.netty.util.concurrent.DefaultPromise.tryFailure(DefaultPromise.java:114)
 at 
org.apache.hbase.thirdparty.io.netty.util.concurrent.DefaultPromise.setFailure0(DefaultPromise.java:533)
 at 
org.apache.hbase.thirdparty.io.netty.util.concurrent.DefaultPromise.setValue0(DefaultPromise.java:540)
 at 
org.apache.hbase.thirdparty.io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:415)
 at 
org.apache.hbase.thirdparty.io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:474)
 at 
org.apache.hbase.thirdparty.io.netty.util.concurrent.DefaultPromise.notifyListeners0(DefaultPromise.java:495)
 at 
org.apache.hbase.thirdparty.io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:502)
 at 
org.apache.hadoop.hbase.ipc.NettyRpcConnection$3.operationComplete(NettyRpcConnection.java:261)
 at 
org.apache.hadoop.hbase.ipc.NettyRpcConnection$3.operationComplete(NettyRpcConnection.java:267)
 at 
org.apache.hadoop.hbase.ipc.NettyRpcConnection.access$500(NettyRpcConnection.java:71)
 at 
org.apache.hadoop.hbase.ipc.NettyRpcConnection.failInit(NettyRpcConnection.java:179)
 at 
org.apache.hbase.thirdparty.io.netty.channel.DefaultChannelPipeline.fireUserEventTriggered(DefaultChannelPipeline.java:924)
 at 
org.apache.hbase.thirdparty.io.netty.channel.AbstractChannelHandlerContext.invokeUserEventTriggered(AbstractChannelHandlerContext.java:312)
 at 
org.apache.hbase.thirdparty.io.netty.channel.AbstractChannelHandlerContext.invokeUserEventTriggered(AbstractChannelHandlerContext.java:326)
 at 
org.apache.hbase.thirdparty.io.netty.channel.DefaultChannelPipeline$HeadContext.userEventTriggered(DefaultChannelPipeline.java:1426)
 at 
org.apache.hbase.thirdparty.io.netty.channel.AbstractChannelHandlerContext.fireUserEventTriggered(AbstractChannelHandlerContext.java:304)
 at 
org.apache.hbase.thirdparty.io.netty.channel.AbstractChannelHandlerContext.invokeUserEventTriggered(AbstractChannelHandlerContext.java:312)
 at 
org.apache.hbase.thirdparty.io.netty.channel.AbstractChannelHandlerContext.invokeUserEventTriggered(AbstractChannelHandlerContext.java:326)
 at 
org.apache.hadoop.hbase.ipc.BufferCallBeforeInitHandler.userEventTriggered(BufferCallBeforeInitHandler.java:92)
 at org.apache.hadoop.hbase.ipc.Call.setException(Call.java:132)
 at org.apache.hadoop.hbase.ipc.Call.callComplete(Call.java:117)
 at 
org.apache.hadoop.hbase.ipc.AbstractRpcClient$3.run(AbstractRpcClient.java:419)
 at 
org.apache.hadoop.hbase.ipc.AbstractRpcClient$3.run(AbstractRpcClient.java:423)
 at 
org.apache.hadoop.hbase.ipc.AbstractRpcClient.access$100(AbstractRpcClient.java:97)
 at 
org.apache.hadoop.hbase.ipc.AbstractRpcClient.onCallFinished(AbstractRpcClient.java:392)
 at org.apache.hadoop.hbase.ipc.IPCUtil.wrapException(IPCUtil.java:177)
java.net.ConnectException: Call to 
regionserver-1.hbase.hbase.svc.cluster.local/10.128.9.13:16020 failed on 
connection exception: 
org.apache.hbase.thirdparty.io.netty.channel.AbstractChannel$AnnotatedConnectException:
 finishConnect(..) failed: No route to host: 
regionserver-1.hbase.hbase.svc.cluster.local/10.128.9.13:16020{code}
These errors happen every 2 seconds (which is the default expiry of failed 
servers)

> The crashed node exists in onlineServer forever, and if it holds the meta 
> data, master will start up hang.
> ----------------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-22041
>                 URL: https://issues.apache.org/jira/browse/HBASE-22041
>             Project: HBase
>          Issue Type: Bug
>            Reporter: lujie
>            Priority: Critical
>         Attachments: bug.zip, normal.zip
>
>
> while master fresh boot, we  crash (kill- 9) the RS who hold meta. we find 
> that the master startup fails and print  thounds of logs like:
> {code:java}
> 2019-03-13 01:09:54,896 WARN [RSProcedureDispatcher-pool4-t1] 
> procedure.RSProcedureDispatcher: request to server 
> hadoop14,16020,1552410583724 failed due to java.net.ConnectException: Call to 
> hadoop14/172.16.1.131:16020 failed on connection exception: 
> org.apache.hbase.thirdparty.io.netty.channel.AbstractChannel$AnnotatedConnectException:
>  syscall:getsockopt(..) failed: Connection refused: 
> hadoop14/172.16.1.131:16020, try=0, retrying...
> 2019-03-13 01:09:55,004 WARN [RSProcedureDispatcher-pool4-t2] 
> procedure.RSProcedureDispatcher: request to server 
> hadoop14,16020,1552410583724 failed due to 
> org.apache.hadoop.hbase.ipc.FailedServerException: Call to 
> hadoop14/172.16.1.131:16020 failed on local exception: 
> org.apache.hadoop.hbase.ipc.FailedServerException: This server is in the 
> failed servers list: hadoop14/172.16.1.131:16020, try=1, retrying...
> 2019-03-13 01:09:55,114 WARN [RSProcedureDispatcher-pool4-t3] 
> procedure.RSProcedureDispatcher: request to server 
> hadoop14,16020,1552410583724 failed due to 
> org.apache.hadoop.hbase.ipc.FailedServerException: Call to 
> hadoop14/172.16.1.131:16020 failed on local exception: 
> org.apache.hadoop.hbase.ipc.FailedServerException: This server is in the 
> failed servers list: hadoop14/172.16.1.131:16020, try=2, retrying...
> 2019-03-13 01:09:55,219 WARN [RSProcedureDispatcher-pool4-t4] 
> procedure.RSProcedureDispatcher: request to server 
> hadoop14,16020,1552410583724 failed due to 
> org.apache.hadoop.hbase.ipc.FailedServerException: Call to 
> hadoop14/172.16.1.131:16020 failed on local exception: 
> org.apache.hadoop.hbase.ipc.FailedServerException: This server is in the 
> failed servers list: hadoop14/172.16.1.131:16020, try=3, retrying...
> 2019-03-13 01:09:55,324 WARN [RSProcedureDispatcher-pool4-t5] 
> procedure.RSProcedureDispatcher: request to server 
> hadoop14,16020,1552410583724 failed due to 
> org.apache.hadoop.hbase.ipc.FailedServerException: Call to 
> hadoop14/172.16.1.131:16020 failed on local exception: 
> org.apache.hadoop.hbase.ipc.FailedServerException: This server is in the 
> failed servers list: hadoop14/172.16.1.131:16020, try=4, retrying...
> 2019-03-13 01:09:55,428 WARN [RSProcedureDispatcher-pool4-t6] 
> procedure.RSProcedureDispatcher: request to server 
> hadoop14,16020,1552410583724 failed due to 
> org.apache.hadoop.hbase.ipc.FailedServerException: Call to 
> hadoop14/172.16.1.131:16020 failed on local exception: 
> org.apache.hadoop.hbase.ipc.FailedServerException: This server is in the 
> failed servers list: hadoop14/172.16.1.131:16020, try=5, retrying...
> 2019-03-13 01:09:55,533 WARN [RSProcedureDispatcher-pool4-t7] 
> procedure.RSProcedureDispatcher: request to server 
> hadoop14,16020,1552410583724 failed due to 
> org.apache.hadoop.hbase.ipc.FailedServerException: Call to 
> hadoop14/172.16.1.131:16020 failed on local exception: 
> org.apache.hadoop.hbase.ipc.FailedServerException: This server is in the 
> failed servers list: hadoop14/172.16.1.131:16020, try=6, retrying...
> 2019-03-13 01:09:55,638 WARN [RSProcedureDispatcher-pool4-t8] 
> procedure.RSProcedureDispatcher: request to server 
> hadoop14,16020,1552410583724 failed due to 
> org.apache.hadoop.hbase.ipc.FailedServerException: Call to 
> hadoop14/172.16.1.131:16020 failed on local exception: 
> org.apache.hadoop.hbase.ipc.FailedServerException: This server is in the 
> failed servers list: hadoop14/172.16.1.131:16020, try=7, retrying...
> 2019-03-13 01:09:55,755 WARN [RSProcedureDispatcher-pool4-t9] 
> procedure.RSProcedureDispatcher: request to server 
> hadoop14,16020,1552410583724 failed due to 
> org.apache.hadoop.hbase.ipc.FailedServerException: Call to 
> hadoop14/172.16.1.131:16020 failed on local exception: 
> org.apache.hadoop.hbase.ipc.FailedServerException: This server is in the 
> failed servers list: hadoop14/172.16.1.131:16020, try=8, retrying...
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to