[jira] [Comment Edited] (HBASE-22041) [k8s] The crashed node exists in onlineServer forever, and if it holds the meta data, master will start up hang.

youchan (Jira) Sun, 26 Mar 2023 01:16:07 -0700


    [ 
https://issues.apache.org/jira/browse/HBASE-22041?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17705041#comment-17705041
 ]


youchan edited comment on HBASE-22041 at 3/26/23 8:15 AM:
----------------------------------------------------------

I had the same situation recently. My HBase is deployed on k8s, and when 
regionServer is evicted and pulled up, master cannot get the correct IP address 
of rs and keeps retrying using the ip address of the previous rs. The reasons 
why the master failed to get the correct ip of rs are: 1. after rs is evicted 
and re-pulled, it takes time for k8s to update DNS, and it takes time for other 
pods to see the latest DNS; 2, the master's own DNS cache has ttl; 3. The 
master does not have the opportunity to re-resolve the rs' ip after getting the 
wrong IP address, because except for the first time when the rs admin is 
obtained, the rs's ip will be resolved, and the rest is obtained directly from 
the 'adminStubs' map. If you want to solve this problem at the kernel level, 
you can remove rs' admin from the map when master sendRequest, and if a 
ConnectionException occurs, so that the master has a chance to resolv rs's ip 
again.
{code:java}
  AdminService.Interface getAdminStub(ServerName serverName) throws IOException 
{
    return ConcurrentMapUtils.computeIfAbsentEx(adminStubs,
      getStubKey(AdminService.getDescriptor().getName(), serverName),      // 
only resolved rs'ip for the first
      () -> createAdminServerStub(serverName));
  }
{code}

and I try to resolve this problem at this 
[PR|https://github.com/apache/hbase/pull/5138].


was (Author: JIRAUSER299437):
I had the same situation recently. My HBase is deployed on k8s, and when 
regionServer is evicted and pulled up, master cannot get the correct IP address 
of rs and keeps retrying using the ip address of the previous rs. The reasons 
why the master failed to get the correct ip of rs are: 1. after rs is evicted 
and re-pulled, it takes time for k8s to update DNS, and it takes time for other 
pods to see the latest DNS; 2, the master's own DNS cache has ttl; 3. The 
master does not have the opportunity to re-resolve the rs' ip after getting the 
wrong IP address, because except for the first time when the rs admin is 
obtained, the rs's ip will be resolved, and the rest is obtained directly from 
the 'adminStubs' map. If you want to solve this problem at the kernel level, 
you can remove rs' admin from the map when master sendRequest, and if a 
ConnectionException occurs, so that the master has a chance to resolv rs's ip 
again.
{code:java}
  AdminService.Interface getAdminStub(ServerName serverName) throws IOException 
{
    return ConcurrentMapUtils.computeIfAbsentEx(adminStubs,
      getStubKey(AdminService.getDescriptor().getName(), serverName),      // 
only resolved rs'ip for the first
      () -> createAdminServerStub(serverName));
  }
{code}

> [k8s] The crashed node exists in onlineServer forever, and if it holds the 
> meta data, master will start up hang.
> ----------------------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-22041
>                 URL: https://issues.apache.org/jira/browse/HBASE-22041
>             Project: HBase
>          Issue Type: Bug
>            Reporter: lujie
>            Priority: Critical
>         Attachments: bug.zip, hbasemaster.log, normal.zip
>
>
> while master fresh boot, we  crash (kill- 9) the RS who hold meta. we find 
> that the master startup fails and print  thounds of logs like:
> {code:java}
> 2019-03-13 01:09:54,896 WARN [RSProcedureDispatcher-pool4-t1] 
> procedure.RSProcedureDispatcher: request to server 
> hadoop14,16020,1552410583724 failed due to java.net.ConnectException: Call to 
> hadoop14/172.16.1.131:16020 failed on connection exception: 
> org.apache.hbase.thirdparty.io.netty.channel.AbstractChannel$AnnotatedConnectException:
>  syscall:getsockopt(..) failed: Connection refused: 
> hadoop14/172.16.1.131:16020, try=0, retrying...
> 2019-03-13 01:09:55,004 WARN [RSProcedureDispatcher-pool4-t2] 
> procedure.RSProcedureDispatcher: request to server 
> hadoop14,16020,1552410583724 failed due to 
> org.apache.hadoop.hbase.ipc.FailedServerException: Call to 
> hadoop14/172.16.1.131:16020 failed on local exception: 
> org.apache.hadoop.hbase.ipc.FailedServerException: This server is in the 
> failed servers list: hadoop14/172.16.1.131:16020, try=1, retrying...
> 2019-03-13 01:09:55,114 WARN [RSProcedureDispatcher-pool4-t3] 
> procedure.RSProcedureDispatcher: request to server 
> hadoop14,16020,1552410583724 failed due to 
> org.apache.hadoop.hbase.ipc.FailedServerException: Call to 
> hadoop14/172.16.1.131:16020 failed on local exception: 
> org.apache.hadoop.hbase.ipc.FailedServerException: This server is in the 
> failed servers list: hadoop14/172.16.1.131:16020, try=2, retrying...
> 2019-03-13 01:09:55,219 WARN [RSProcedureDispatcher-pool4-t4] 
> procedure.RSProcedureDispatcher: request to server 
> hadoop14,16020,1552410583724 failed due to 
> org.apache.hadoop.hbase.ipc.FailedServerException: Call to 
> hadoop14/172.16.1.131:16020 failed on local exception: 
> org.apache.hadoop.hbase.ipc.FailedServerException: This server is in the 
> failed servers list: hadoop14/172.16.1.131:16020, try=3, retrying...
> 2019-03-13 01:09:55,324 WARN [RSProcedureDispatcher-pool4-t5] 
> procedure.RSProcedureDispatcher: request to server 
> hadoop14,16020,1552410583724 failed due to 
> org.apache.hadoop.hbase.ipc.FailedServerException: Call to 
> hadoop14/172.16.1.131:16020 failed on local exception: 
> org.apache.hadoop.hbase.ipc.FailedServerException: This server is in the 
> failed servers list: hadoop14/172.16.1.131:16020, try=4, retrying...
> 2019-03-13 01:09:55,428 WARN [RSProcedureDispatcher-pool4-t6] 
> procedure.RSProcedureDispatcher: request to server 
> hadoop14,16020,1552410583724 failed due to 
> org.apache.hadoop.hbase.ipc.FailedServerException: Call to 
> hadoop14/172.16.1.131:16020 failed on local exception: 
> org.apache.hadoop.hbase.ipc.FailedServerException: This server is in the 
> failed servers list: hadoop14/172.16.1.131:16020, try=5, retrying...
> 2019-03-13 01:09:55,533 WARN [RSProcedureDispatcher-pool4-t7] 
> procedure.RSProcedureDispatcher: request to server 
> hadoop14,16020,1552410583724 failed due to 
> org.apache.hadoop.hbase.ipc.FailedServerException: Call to 
> hadoop14/172.16.1.131:16020 failed on local exception: 
> org.apache.hadoop.hbase.ipc.FailedServerException: This server is in the 
> failed servers list: hadoop14/172.16.1.131:16020, try=6, retrying...
> 2019-03-13 01:09:55,638 WARN [RSProcedureDispatcher-pool4-t8] 
> procedure.RSProcedureDispatcher: request to server 
> hadoop14,16020,1552410583724 failed due to 
> org.apache.hadoop.hbase.ipc.FailedServerException: Call to 
> hadoop14/172.16.1.131:16020 failed on local exception: 
> org.apache.hadoop.hbase.ipc.FailedServerException: This server is in the 
> failed servers list: hadoop14/172.16.1.131:16020, try=7, retrying...
> 2019-03-13 01:09:55,755 WARN [RSProcedureDispatcher-pool4-t9] 
> procedure.RSProcedureDispatcher: request to server 
> hadoop14,16020,1552410583724 failed due to 
> org.apache.hadoop.hbase.ipc.FailedServerException: Call to 
> hadoop14/172.16.1.131:16020 failed on local exception: 
> org.apache.hadoop.hbase.ipc.FailedServerException: This server is in the 
> failed servers list: hadoop14/172.16.1.131:16020, try=8, retrying...
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Comment Edited] (HBASE-22041) [k8s] The crashed node exists in onlineServer forever, and if it holds the meta data, master will start up hang.

Reply via email to