[
https://issues.apache.org/jira/browse/HBASE-22041?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17114453#comment-17114453
]
Duo Zhang commented on HBASE-22041:
-----------------------------------
When I implemented an in house RPC framework about ten years ago, my solution
was to create an unresolved ISA as the rpc connections key, and once we want to
connect to the remote peer, we recreate a resolved one.
> [k8s] The crashed node exists in onlineServer forever, and if it holds the
> meta data, master will start up hang.
> ----------------------------------------------------------------------------------------------------------------
>
> Key: HBASE-22041
> URL: https://issues.apache.org/jira/browse/HBASE-22041
> Project: HBase
> Issue Type: Bug
> Reporter: lujie
> Priority: Critical
> Attachments: bug.zip, hbasemaster.log, normal.zip
>
>
> while master fresh boot, we crash (kill- 9) the RS who hold meta. we find
> that the master startup fails and print thounds of logs like:
> {code:java}
> 2019-03-13 01:09:54,896 WARN [RSProcedureDispatcher-pool4-t1]
> procedure.RSProcedureDispatcher: request to server
> hadoop14,16020,1552410583724 failed due to java.net.ConnectException: Call to
> hadoop14/172.16.1.131:16020 failed on connection exception:
> org.apache.hbase.thirdparty.io.netty.channel.AbstractChannel$AnnotatedConnectException:
> syscall:getsockopt(..) failed: Connection refused:
> hadoop14/172.16.1.131:16020, try=0, retrying...
> 2019-03-13 01:09:55,004 WARN [RSProcedureDispatcher-pool4-t2]
> procedure.RSProcedureDispatcher: request to server
> hadoop14,16020,1552410583724 failed due to
> org.apache.hadoop.hbase.ipc.FailedServerException: Call to
> hadoop14/172.16.1.131:16020 failed on local exception:
> org.apache.hadoop.hbase.ipc.FailedServerException: This server is in the
> failed servers list: hadoop14/172.16.1.131:16020, try=1, retrying...
> 2019-03-13 01:09:55,114 WARN [RSProcedureDispatcher-pool4-t3]
> procedure.RSProcedureDispatcher: request to server
> hadoop14,16020,1552410583724 failed due to
> org.apache.hadoop.hbase.ipc.FailedServerException: Call to
> hadoop14/172.16.1.131:16020 failed on local exception:
> org.apache.hadoop.hbase.ipc.FailedServerException: This server is in the
> failed servers list: hadoop14/172.16.1.131:16020, try=2, retrying...
> 2019-03-13 01:09:55,219 WARN [RSProcedureDispatcher-pool4-t4]
> procedure.RSProcedureDispatcher: request to server
> hadoop14,16020,1552410583724 failed due to
> org.apache.hadoop.hbase.ipc.FailedServerException: Call to
> hadoop14/172.16.1.131:16020 failed on local exception:
> org.apache.hadoop.hbase.ipc.FailedServerException: This server is in the
> failed servers list: hadoop14/172.16.1.131:16020, try=3, retrying...
> 2019-03-13 01:09:55,324 WARN [RSProcedureDispatcher-pool4-t5]
> procedure.RSProcedureDispatcher: request to server
> hadoop14,16020,1552410583724 failed due to
> org.apache.hadoop.hbase.ipc.FailedServerException: Call to
> hadoop14/172.16.1.131:16020 failed on local exception:
> org.apache.hadoop.hbase.ipc.FailedServerException: This server is in the
> failed servers list: hadoop14/172.16.1.131:16020, try=4, retrying...
> 2019-03-13 01:09:55,428 WARN [RSProcedureDispatcher-pool4-t6]
> procedure.RSProcedureDispatcher: request to server
> hadoop14,16020,1552410583724 failed due to
> org.apache.hadoop.hbase.ipc.FailedServerException: Call to
> hadoop14/172.16.1.131:16020 failed on local exception:
> org.apache.hadoop.hbase.ipc.FailedServerException: This server is in the
> failed servers list: hadoop14/172.16.1.131:16020, try=5, retrying...
> 2019-03-13 01:09:55,533 WARN [RSProcedureDispatcher-pool4-t7]
> procedure.RSProcedureDispatcher: request to server
> hadoop14,16020,1552410583724 failed due to
> org.apache.hadoop.hbase.ipc.FailedServerException: Call to
> hadoop14/172.16.1.131:16020 failed on local exception:
> org.apache.hadoop.hbase.ipc.FailedServerException: This server is in the
> failed servers list: hadoop14/172.16.1.131:16020, try=6, retrying...
> 2019-03-13 01:09:55,638 WARN [RSProcedureDispatcher-pool4-t8]
> procedure.RSProcedureDispatcher: request to server
> hadoop14,16020,1552410583724 failed due to
> org.apache.hadoop.hbase.ipc.FailedServerException: Call to
> hadoop14/172.16.1.131:16020 failed on local exception:
> org.apache.hadoop.hbase.ipc.FailedServerException: This server is in the
> failed servers list: hadoop14/172.16.1.131:16020, try=7, retrying...
> 2019-03-13 01:09:55,755 WARN [RSProcedureDispatcher-pool4-t9]
> procedure.RSProcedureDispatcher: request to server
> hadoop14,16020,1552410583724 failed due to
> org.apache.hadoop.hbase.ipc.FailedServerException: Call to
> hadoop14/172.16.1.131:16020 failed on local exception:
> org.apache.hadoop.hbase.ipc.FailedServerException: This server is in the
> failed servers list: hadoop14/172.16.1.131:16020, try=8, retrying...
> {code}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)