[
https://issues.apache.org/jira/browse/HBASE-22041?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17111654#comment-17111654
]
Michael Stack commented on HBASE-22041:
---------------------------------------
Thanks for the detail [~timoha]. Haven't looked at code yet.
We first report it as being in the failed server list and then we start doing
'No route to host'. It starts after the container comes back w/ new IP (looks
the same though across textboxes)? What are the dns timeouts on this host?
(networkaddress.cache.ttl). We should give up if 'no route to host' for sure.
> The crashed node exists in onlineServer forever, and if it holds the meta
> data, master will start up hang.
> ----------------------------------------------------------------------------------------------------------
>
> Key: HBASE-22041
> URL: https://issues.apache.org/jira/browse/HBASE-22041
> Project: HBase
> Issue Type: Bug
> Reporter: lujie
> Priority: Critical
> Attachments: bug.zip, normal.zip
>
>
> while master fresh boot, we crash (kill- 9) the RS who hold meta. we find
> that the master startup fails and print thounds of logs like:
> {code:java}
> 2019-03-13 01:09:54,896 WARN [RSProcedureDispatcher-pool4-t1]
> procedure.RSProcedureDispatcher: request to server
> hadoop14,16020,1552410583724 failed due to java.net.ConnectException: Call to
> hadoop14/172.16.1.131:16020 failed on connection exception:
> org.apache.hbase.thirdparty.io.netty.channel.AbstractChannel$AnnotatedConnectException:
> syscall:getsockopt(..) failed: Connection refused:
> hadoop14/172.16.1.131:16020, try=0, retrying...
> 2019-03-13 01:09:55,004 WARN [RSProcedureDispatcher-pool4-t2]
> procedure.RSProcedureDispatcher: request to server
> hadoop14,16020,1552410583724 failed due to
> org.apache.hadoop.hbase.ipc.FailedServerException: Call to
> hadoop14/172.16.1.131:16020 failed on local exception:
> org.apache.hadoop.hbase.ipc.FailedServerException: This server is in the
> failed servers list: hadoop14/172.16.1.131:16020, try=1, retrying...
> 2019-03-13 01:09:55,114 WARN [RSProcedureDispatcher-pool4-t3]
> procedure.RSProcedureDispatcher: request to server
> hadoop14,16020,1552410583724 failed due to
> org.apache.hadoop.hbase.ipc.FailedServerException: Call to
> hadoop14/172.16.1.131:16020 failed on local exception:
> org.apache.hadoop.hbase.ipc.FailedServerException: This server is in the
> failed servers list: hadoop14/172.16.1.131:16020, try=2, retrying...
> 2019-03-13 01:09:55,219 WARN [RSProcedureDispatcher-pool4-t4]
> procedure.RSProcedureDispatcher: request to server
> hadoop14,16020,1552410583724 failed due to
> org.apache.hadoop.hbase.ipc.FailedServerException: Call to
> hadoop14/172.16.1.131:16020 failed on local exception:
> org.apache.hadoop.hbase.ipc.FailedServerException: This server is in the
> failed servers list: hadoop14/172.16.1.131:16020, try=3, retrying...
> 2019-03-13 01:09:55,324 WARN [RSProcedureDispatcher-pool4-t5]
> procedure.RSProcedureDispatcher: request to server
> hadoop14,16020,1552410583724 failed due to
> org.apache.hadoop.hbase.ipc.FailedServerException: Call to
> hadoop14/172.16.1.131:16020 failed on local exception:
> org.apache.hadoop.hbase.ipc.FailedServerException: This server is in the
> failed servers list: hadoop14/172.16.1.131:16020, try=4, retrying...
> 2019-03-13 01:09:55,428 WARN [RSProcedureDispatcher-pool4-t6]
> procedure.RSProcedureDispatcher: request to server
> hadoop14,16020,1552410583724 failed due to
> org.apache.hadoop.hbase.ipc.FailedServerException: Call to
> hadoop14/172.16.1.131:16020 failed on local exception:
> org.apache.hadoop.hbase.ipc.FailedServerException: This server is in the
> failed servers list: hadoop14/172.16.1.131:16020, try=5, retrying...
> 2019-03-13 01:09:55,533 WARN [RSProcedureDispatcher-pool4-t7]
> procedure.RSProcedureDispatcher: request to server
> hadoop14,16020,1552410583724 failed due to
> org.apache.hadoop.hbase.ipc.FailedServerException: Call to
> hadoop14/172.16.1.131:16020 failed on local exception:
> org.apache.hadoop.hbase.ipc.FailedServerException: This server is in the
> failed servers list: hadoop14/172.16.1.131:16020, try=6, retrying...
> 2019-03-13 01:09:55,638 WARN [RSProcedureDispatcher-pool4-t8]
> procedure.RSProcedureDispatcher: request to server
> hadoop14,16020,1552410583724 failed due to
> org.apache.hadoop.hbase.ipc.FailedServerException: Call to
> hadoop14/172.16.1.131:16020 failed on local exception:
> org.apache.hadoop.hbase.ipc.FailedServerException: This server is in the
> failed servers list: hadoop14/172.16.1.131:16020, try=7, retrying...
> 2019-03-13 01:09:55,755 WARN [RSProcedureDispatcher-pool4-t9]
> procedure.RSProcedureDispatcher: request to server
> hadoop14,16020,1552410583724 failed due to
> org.apache.hadoop.hbase.ipc.FailedServerException: Call to
> hadoop14/172.16.1.131:16020 failed on local exception:
> org.apache.hadoop.hbase.ipc.FailedServerException: This server is in the
> failed servers list: hadoop14/172.16.1.131:16020, try=8, retrying...
> {code}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)