[jira] [Commented] (HBASE-22041) [k8s] The crashed node exists in onlineServer forever, and if it holds the meta data, master will start up hang.

Michael Stack (Jira) Wed, 27 May 2020 16:13:24 -0700


    [ 
https://issues.apache.org/jira/browse/HBASE-22041?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17118159#comment-17118159
 ]


Michael Stack commented on HBASE-22041:
---------------------------------------

The TRACE-level logs really help. Thanks [~timoha]

It looks like sun.net.inetaddr.ttl ==  networkaddress.cache.ttl.

If set to 10, that might be too long looking at the log.

I see this in the log around the time that regionserver-2 came back up:

{code}
 2020-05-21 17:28:55,010 DEBUG 
[RpcServer.default.FPBQ.Fifo.handler=48,queue=3,port=16000] 
master.ServerManager: STARTUP: Server 
regionserver-2.hbase.hbase.svc.cluster.local,16020,1590082132059 came back up, 
removed it from the dead servers list
{code}

What above says is that the old server instance, 
regionserver-2.hbase.hbase.svc.cluster.local,16020,1590081637367, has been 
replaced in the Master by the 
regionserver-2.hbase.hbase.svc.cluster.local,16020,1590082132059 (see the 
differing timestamps).

A little later in the log I see this:

{code}
 2020-05-21 17:28:55,745 DEBUG [RSProcedureDispatcher-pool3-t31] 
master.ServerManager: New admin connection to 
regionserver-2.hbase.hbase.svc.cluster.local,16020,1590082132059
 2020-05-21 17:28:55,745 TRACE [RSProcedureDispatcher-pool3-t31] 
ipc.NettyRpcConnection: Connecting to 
regionserver-2.hbase.hbase.svc.cluster.local/10.128.14.39:16020
...
{code}

i.e. the ServerManager is processing the 'new' server part of which is setting 
up a new connection.

Logging is a little sparse here but code looks like it should do right 
thing...... The ServerManager asks to get the admin instance which ends up in 
here....

{code}
  @Override
  public AdminProtos.AdminService.BlockingInterface getAdmin(ServerName 
serverName)
      throws IOException {
    checkClosed();
    if (isDeadServer(serverName)) {
      throw new RegionServerStoppedException(serverName + " is dead.");
    }
    String key = 
getStubKey(AdminProtos.AdminService.BlockingInterface.class.getName(), 
serverName,
      this.hostnamesCanChange);
    return (AdminProtos.AdminService.BlockingInterface) 
computeIfAbsentEx(stubs, key, () -> {
      BlockingRpcChannel channel =
          this.rpcClient.createBlockingRpcChannel(serverName, user, rpcTimeout);
      return AdminProtos.AdminService.newBlockingStub(channel);
    });
  }
{code}

The call to getStubKey should be doing a new lookup (hostnamesCanChange 
defaults true -- see  HBASE-14544) but it is happening at....~17:28:55,745.

The new server is registered at the Master at ~17:28:55,010.

When did K8S register in DNS the new pod?

What happens if you run w/  -Dsun.net.inetaddr.ttl=1 instead of 10?







> [k8s] The crashed node exists in onlineServer forever, and if it holds the 
> meta data, master will start up hang.
> ----------------------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-22041
>                 URL: https://issues.apache.org/jira/browse/HBASE-22041
>             Project: HBase
>          Issue Type: Bug
>            Reporter: lujie
>            Priority: Critical
>         Attachments: bug.zip, hbasemaster.log, normal.zip
>
>
> while master fresh boot, we  crash (kill- 9) the RS who hold meta. we find 
> that the master startup fails and print  thounds of logs like:
> {code:java}
> 2019-03-13 01:09:54,896 WARN [RSProcedureDispatcher-pool4-t1] 
> procedure.RSProcedureDispatcher: request to server 
> hadoop14,16020,1552410583724 failed due to java.net.ConnectException: Call to 
> hadoop14/172.16.1.131:16020 failed on connection exception: 
> org.apache.hbase.thirdparty.io.netty.channel.AbstractChannel$AnnotatedConnectException:
>  syscall:getsockopt(..) failed: Connection refused: 
> hadoop14/172.16.1.131:16020, try=0, retrying...
> 2019-03-13 01:09:55,004 WARN [RSProcedureDispatcher-pool4-t2] 
> procedure.RSProcedureDispatcher: request to server 
> hadoop14,16020,1552410583724 failed due to 
> org.apache.hadoop.hbase.ipc.FailedServerException: Call to 
> hadoop14/172.16.1.131:16020 failed on local exception: 
> org.apache.hadoop.hbase.ipc.FailedServerException: This server is in the 
> failed servers list: hadoop14/172.16.1.131:16020, try=1, retrying...
> 2019-03-13 01:09:55,114 WARN [RSProcedureDispatcher-pool4-t3] 
> procedure.RSProcedureDispatcher: request to server 
> hadoop14,16020,1552410583724 failed due to 
> org.apache.hadoop.hbase.ipc.FailedServerException: Call to 
> hadoop14/172.16.1.131:16020 failed on local exception: 
> org.apache.hadoop.hbase.ipc.FailedServerException: This server is in the 
> failed servers list: hadoop14/172.16.1.131:16020, try=2, retrying...
> 2019-03-13 01:09:55,219 WARN [RSProcedureDispatcher-pool4-t4] 
> procedure.RSProcedureDispatcher: request to server 
> hadoop14,16020,1552410583724 failed due to 
> org.apache.hadoop.hbase.ipc.FailedServerException: Call to 
> hadoop14/172.16.1.131:16020 failed on local exception: 
> org.apache.hadoop.hbase.ipc.FailedServerException: This server is in the 
> failed servers list: hadoop14/172.16.1.131:16020, try=3, retrying...
> 2019-03-13 01:09:55,324 WARN [RSProcedureDispatcher-pool4-t5] 
> procedure.RSProcedureDispatcher: request to server 
> hadoop14,16020,1552410583724 failed due to 
> org.apache.hadoop.hbase.ipc.FailedServerException: Call to 
> hadoop14/172.16.1.131:16020 failed on local exception: 
> org.apache.hadoop.hbase.ipc.FailedServerException: This server is in the 
> failed servers list: hadoop14/172.16.1.131:16020, try=4, retrying...
> 2019-03-13 01:09:55,428 WARN [RSProcedureDispatcher-pool4-t6] 
> procedure.RSProcedureDispatcher: request to server 
> hadoop14,16020,1552410583724 failed due to 
> org.apache.hadoop.hbase.ipc.FailedServerException: Call to 
> hadoop14/172.16.1.131:16020 failed on local exception: 
> org.apache.hadoop.hbase.ipc.FailedServerException: This server is in the 
> failed servers list: hadoop14/172.16.1.131:16020, try=5, retrying...
> 2019-03-13 01:09:55,533 WARN [RSProcedureDispatcher-pool4-t7] 
> procedure.RSProcedureDispatcher: request to server 
> hadoop14,16020,1552410583724 failed due to 
> org.apache.hadoop.hbase.ipc.FailedServerException: Call to 
> hadoop14/172.16.1.131:16020 failed on local exception: 
> org.apache.hadoop.hbase.ipc.FailedServerException: This server is in the 
> failed servers list: hadoop14/172.16.1.131:16020, try=6, retrying...
> 2019-03-13 01:09:55,638 WARN [RSProcedureDispatcher-pool4-t8] 
> procedure.RSProcedureDispatcher: request to server 
> hadoop14,16020,1552410583724 failed due to 
> org.apache.hadoop.hbase.ipc.FailedServerException: Call to 
> hadoop14/172.16.1.131:16020 failed on local exception: 
> org.apache.hadoop.hbase.ipc.FailedServerException: This server is in the 
> failed servers list: hadoop14/172.16.1.131:16020, try=7, retrying...
> 2019-03-13 01:09:55,755 WARN [RSProcedureDispatcher-pool4-t9] 
> procedure.RSProcedureDispatcher: request to server 
> hadoop14,16020,1552410583724 failed due to 
> org.apache.hadoop.hbase.ipc.FailedServerException: Call to 
> hadoop14/172.16.1.131:16020 failed on local exception: 
> org.apache.hadoop.hbase.ipc.FailedServerException: This server is in the 
> failed servers list: hadoop14/172.16.1.131:16020, try=8, retrying...
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (HBASE-22041) [k8s] The crashed node exists in onlineServer forever, and if it holds the meta data, master will start up hang.

Reply via email to