[jira] [Commented] (HBASE-22041) [k8s] The crashed node exists in onlineServer forever, and if it holds the meta data, master will start up hang.

Andrey Elenskiy (Jira) Thu, 21 May 2020 10:39:08 -0700


    [ 
https://issues.apache.org/jira/browse/HBASE-22041?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17113423#comment-17113423
 ]


Andrey Elenskiy commented on HBASE-22041:
-----------------------------------------

Attached entire hbasemaster log (hbasemaster.log) with TRACE enabled right 
before trying to reproduce the issue.

The time I've triggered the issue was "Thu May 21 17:28:42 UTC 2020". And the 
topology looked like so:
{noformat}
hbasemaster-0 10.128.25.30
hbasemaster-1 10.128.6.51 
regionserver-0 10.128.53.53
regionserver-1 10.128.9.37 
regionserver-2 10.128.14.39{noformat}
 

They way I trigger the issue is by picking a regionserver with 0 regions 
(because it was restarted recently), triggering "balancer" and killing the 
regionserver during the execution of balancer. In this case the regionserver I 
killed was regionserver-2. Here's how topology looked like after regionserver 2 
came back up:

 
{noformat}
hbasemaster-0 10.128.25.30
hbasemaster-1 10.128.6.51 
regionserver-0 10.128.53.53
regionserver-1 10.128.9.37 
regionserver-2 10.128.14.40{noformat}
You can see that regionserver-2 came back up with IP 10.128.14.40, but 
hbasemaster still tries to contact 10.128.14.39

 

> [k8s] The crashed node exists in onlineServer forever, and if it holds the 
> meta data, master will start up hang.
> ----------------------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-22041
>                 URL: https://issues.apache.org/jira/browse/HBASE-22041
>             Project: HBase
>          Issue Type: Bug
>            Reporter: lujie
>            Priority: Critical
>         Attachments: bug.zip, hbasemaster.log, normal.zip
>
>
> while master fresh boot, we  crash (kill- 9) the RS who hold meta. we find 
> that the master startup fails and print  thounds of logs like:
> {code:java}
> 2019-03-13 01:09:54,896 WARN [RSProcedureDispatcher-pool4-t1] 
> procedure.RSProcedureDispatcher: request to server 
> hadoop14,16020,1552410583724 failed due to java.net.ConnectException: Call to 
> hadoop14/172.16.1.131:16020 failed on connection exception: 
> org.apache.hbase.thirdparty.io.netty.channel.AbstractChannel$AnnotatedConnectException:
>  syscall:getsockopt(..) failed: Connection refused: 
> hadoop14/172.16.1.131:16020, try=0, retrying...
> 2019-03-13 01:09:55,004 WARN [RSProcedureDispatcher-pool4-t2] 
> procedure.RSProcedureDispatcher: request to server 
> hadoop14,16020,1552410583724 failed due to 
> org.apache.hadoop.hbase.ipc.FailedServerException: Call to 
> hadoop14/172.16.1.131:16020 failed on local exception: 
> org.apache.hadoop.hbase.ipc.FailedServerException: This server is in the 
> failed servers list: hadoop14/172.16.1.131:16020, try=1, retrying...
> 2019-03-13 01:09:55,114 WARN [RSProcedureDispatcher-pool4-t3] 
> procedure.RSProcedureDispatcher: request to server 
> hadoop14,16020,1552410583724 failed due to 
> org.apache.hadoop.hbase.ipc.FailedServerException: Call to 
> hadoop14/172.16.1.131:16020 failed on local exception: 
> org.apache.hadoop.hbase.ipc.FailedServerException: This server is in the 
> failed servers list: hadoop14/172.16.1.131:16020, try=2, retrying...
> 2019-03-13 01:09:55,219 WARN [RSProcedureDispatcher-pool4-t4] 
> procedure.RSProcedureDispatcher: request to server 
> hadoop14,16020,1552410583724 failed due to 
> org.apache.hadoop.hbase.ipc.FailedServerException: Call to 
> hadoop14/172.16.1.131:16020 failed on local exception: 
> org.apache.hadoop.hbase.ipc.FailedServerException: This server is in the 
> failed servers list: hadoop14/172.16.1.131:16020, try=3, retrying...
> 2019-03-13 01:09:55,324 WARN [RSProcedureDispatcher-pool4-t5] 
> procedure.RSProcedureDispatcher: request to server 
> hadoop14,16020,1552410583724 failed due to 
> org.apache.hadoop.hbase.ipc.FailedServerException: Call to 
> hadoop14/172.16.1.131:16020 failed on local exception: 
> org.apache.hadoop.hbase.ipc.FailedServerException: This server is in the 
> failed servers list: hadoop14/172.16.1.131:16020, try=4, retrying...
> 2019-03-13 01:09:55,428 WARN [RSProcedureDispatcher-pool4-t6] 
> procedure.RSProcedureDispatcher: request to server 
> hadoop14,16020,1552410583724 failed due to 
> org.apache.hadoop.hbase.ipc.FailedServerException: Call to 
> hadoop14/172.16.1.131:16020 failed on local exception: 
> org.apache.hadoop.hbase.ipc.FailedServerException: This server is in the 
> failed servers list: hadoop14/172.16.1.131:16020, try=5, retrying...
> 2019-03-13 01:09:55,533 WARN [RSProcedureDispatcher-pool4-t7] 
> procedure.RSProcedureDispatcher: request to server 
> hadoop14,16020,1552410583724 failed due to 
> org.apache.hadoop.hbase.ipc.FailedServerException: Call to 
> hadoop14/172.16.1.131:16020 failed on local exception: 
> org.apache.hadoop.hbase.ipc.FailedServerException: This server is in the 
> failed servers list: hadoop14/172.16.1.131:16020, try=6, retrying...
> 2019-03-13 01:09:55,638 WARN [RSProcedureDispatcher-pool4-t8] 
> procedure.RSProcedureDispatcher: request to server 
> hadoop14,16020,1552410583724 failed due to 
> org.apache.hadoop.hbase.ipc.FailedServerException: Call to 
> hadoop14/172.16.1.131:16020 failed on local exception: 
> org.apache.hadoop.hbase.ipc.FailedServerException: This server is in the 
> failed servers list: hadoop14/172.16.1.131:16020, try=7, retrying...
> 2019-03-13 01:09:55,755 WARN [RSProcedureDispatcher-pool4-t9] 
> procedure.RSProcedureDispatcher: request to server 
> hadoop14,16020,1552410583724 failed due to 
> org.apache.hadoop.hbase.ipc.FailedServerException: Call to 
> hadoop14/172.16.1.131:16020 failed on local exception: 
> org.apache.hadoop.hbase.ipc.FailedServerException: This server is in the 
> failed servers list: hadoop14/172.16.1.131:16020, try=8, retrying...
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (HBASE-22041) [k8s] The crashed node exists in onlineServer forever, and if it holds the meta data, master will start up hang.

Reply via email to