[ 
https://issues.apache.org/jira/browse/HBASE-28669?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ZhongYou Li updated HBASE-28669:
--------------------------------
    Description: 
The peer "to_pd_A" has been removed, but there is an error log in RegionServer, 
error log:
{code:java}
2024-06-11 09:42:34.074 ERROR 
[ReplicationExecutor-0.replicationSource,to_pd_A-172.30.112.11,6002,1709612684705-SendThread(bjtx-hbase-onll-meta-01:2181)]
 client.StaticHostProvider: Unable to resolve address: 
bjtx-hbase-onll-meta-03:2181
java.net.UnknownHostException: bjtx-hbase-onll-meta-03
   at java.net.InetAddress$CachedAddresses.get(InetAddress.java:764)
   at java.net.InetAddress.getAllByName0(InetAddress.java:1291)
   at java.net.InetAddress.getAllByName(InetAddress.java:1144)
   at java.net.InetAddress.getAllByName(InetAddress.java:1065)
   at 
org.apache.zookeeper.client.StaticHostProvider$1.getAllByName(StaticHostProvider.java:92)
   at 
org.apache.zookeeper.client.StaticHostProvider.resolve(StaticHostProvider.java:147)
   at 
org.apache.zookeeper.client.StaticHostProvider.next(StaticHostProvider.java:375)
   at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1137){code}
Here are the steps to reproduce:

I have 3 RegionServers. The following steps can reproduce the phenomenon of ZK 
connection leakage:
1. Turn on the Replication function
2. Create a peer
3. Shut down any two RegionServers for a few minutes and restart them
4. Print the thread stack on the RegionServer that is not shut down, search for 
the keyword <peerId>, and you can see that there are 4 more threads with 
ZooKeeper
5. By removing the peer, the extra 4 threads still exist

 

The following is the thread stack leak in one of my RegionServers:
{code:java}
"ReplicationExecutor-0.replicationSource,lizy_test_replication-10.0.16.29,6002,1718180442225-EventThread"
 #610 daemon prio=5 os_prio=0 cpu=0.27ms elapsed=466.94s tid=0x00007efc58179000 
nid=0x5a051 waiting on condition [0x00007efc2cdef000]

"ReplicationExecutor-0.replicationSource,lizy_test_replication-10.0.16.29,6002,1718180442225-SendThread(10.0.16.100:2181)"
 #609 daemon prio=5 os_prio=0 cpu=3.02ms elapsed=466.94s tid=0x00007efc58178800 
nid=0x5a050 runnable [0x00007efc2cef0000]

"ReplicationExecutor-0.replicationSource,lizy_test_replication-10.0.16.9,6002,1718180457260-EventThread"
 #505 daemon prio=5 os_prio=0 cpu=0.27ms elapsed=556.09s tid=0x00007efc50094800 
nid=0x59c04 waiting on condition [0x00007efc2d7f7000]

"ReplicationExecutor-0.replicationSource,lizy_test_replication-10.0.16.9,6002,1718180457260-SendThread(10.0.16.100:2181)"
 #504 daemon prio=5 os_prio=0 cpu=3.72ms elapsed=556.09s tid=0x00007efc50093000 
nid=0x59c03 runnable [0x00007efc2d8f8000] {code}

  was:
Original error log:
{code:java}
2024-06-11 09:42:34.074 ERROR 
[ReplicationExecutor-0.replicationSource,to_pd_A-172.30.112.11,6002,1709612684705-SendThread(bjtx-hbase-onll-meta-01:2181)]
 client.StaticHostProvider: Unable to resolve address: 
bjtx-hbase-onll-meta-03:2181
java.net.UnknownHostException: bjtx-hbase-onll-meta-03
   at java.net.InetAddress$CachedAddresses.get(InetAddress.java:764)
   at java.net.InetAddress.getAllByName0(InetAddress.java:1291)
   at java.net.InetAddress.getAllByName(InetAddress.java:1144)
   at java.net.InetAddress.getAllByName(InetAddress.java:1065)
   at 
org.apache.zookeeper.client.StaticHostProvider$1.getAllByName(StaticHostProvider.java:92)
   at 
org.apache.zookeeper.client.StaticHostProvider.resolve(StaticHostProvider.java:147)
   at 
org.apache.zookeeper.client.StaticHostProvider.next(StaticHostProvider.java:375)
   at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1137){code}
The peer "to_pd_A" has been removed.

Here are the steps to reproduce:

I have 3 RegionServers. The following steps can reproduce the phenomenon of ZK 
connection leakage:
1. Turn on the Replication function
2. Create a peer
3. Shut down any two RegionServers for a few minutes and restart them
4. Print the thread stack on the RegionServer that is not shut down, search for 
the keyword <peerId>, and you can see that there are 4 more threads with 
ZooKeeper
5. By removing the peer, the extra 4 threads still exist

 

The following is the thread stack leak in one of my RegionServers:
{code:java}
"ReplicationExecutor-0.replicationSource,lizy_test_replication-10.0.16.29,6002,1718180442225-EventThread"
 #610 daemon prio=5 os_prio=0 cpu=0.27ms elapsed=466.94s tid=0x00007efc58179000 
nid=0x5a051 waiting on condition [0x00007efc2cdef000]

"ReplicationExecutor-0.replicationSource,lizy_test_replication-10.0.16.29,6002,1718180442225-SendThread(10.0.16.100:2181)"
 #609 daemon prio=5 os_prio=0 cpu=3.02ms elapsed=466.94s tid=0x00007efc58178800 
nid=0x5a050 runnable [0x00007efc2cef0000]

"ReplicationExecutor-0.replicationSource,lizy_test_replication-10.0.16.9,6002,1718180457260-EventThread"
 #505 daemon prio=5 os_prio=0 cpu=0.27ms elapsed=556.09s tid=0x00007efc50094800 
nid=0x59c04 waiting on condition [0x00007efc2d7f7000]

"ReplicationExecutor-0.replicationSource,lizy_test_replication-10.0.16.9,6002,1718180457260-SendThread(10.0.16.100:2181)"
 #504 daemon prio=5 os_prio=0 cpu=3.72ms elapsed=556.09s tid=0x00007efc50093000 
nid=0x59c03 runnable [0x00007efc2d8f8000] {code}


> After one RegionServer restarts, another RegionServer leaks a connection to 
> ZooKeeper
> -------------------------------------------------------------------------------------
>
>                 Key: HBASE-28669
>                 URL: https://issues.apache.org/jira/browse/HBASE-28669
>             Project: HBase
>          Issue Type: Bug
>          Components: Replication
>    Affects Versions: 2.4.5
>            Reporter: ZhongYou Li
>            Priority: Minor
>              Labels: Replication
>
> The peer "to_pd_A" has been removed, but there is an error log in 
> RegionServer, error log:
> {code:java}
> 2024-06-11 09:42:34.074 ERROR 
> [ReplicationExecutor-0.replicationSource,to_pd_A-172.30.112.11,6002,1709612684705-SendThread(bjtx-hbase-onll-meta-01:2181)]
>  client.StaticHostProvider: Unable to resolve address: 
> bjtx-hbase-onll-meta-03:2181
> java.net.UnknownHostException: bjtx-hbase-onll-meta-03
>    at java.net.InetAddress$CachedAddresses.get(InetAddress.java:764)
>    at java.net.InetAddress.getAllByName0(InetAddress.java:1291)
>    at java.net.InetAddress.getAllByName(InetAddress.java:1144)
>    at java.net.InetAddress.getAllByName(InetAddress.java:1065)
>    at 
> org.apache.zookeeper.client.StaticHostProvider$1.getAllByName(StaticHostProvider.java:92)
>    at 
> org.apache.zookeeper.client.StaticHostProvider.resolve(StaticHostProvider.java:147)
>    at 
> org.apache.zookeeper.client.StaticHostProvider.next(StaticHostProvider.java:375)
>    at 
> org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1137){code}
> Here are the steps to reproduce:
> I have 3 RegionServers. The following steps can reproduce the phenomenon of 
> ZK connection leakage:
> 1. Turn on the Replication function
> 2. Create a peer
> 3. Shut down any two RegionServers for a few minutes and restart them
> 4. Print the thread stack on the RegionServer that is not shut down, search 
> for the keyword <peerId>, and you can see that there are 4 more threads with 
> ZooKeeper
> 5. By removing the peer, the extra 4 threads still exist
>  
> The following is the thread stack leak in one of my RegionServers:
> {code:java}
> "ReplicationExecutor-0.replicationSource,lizy_test_replication-10.0.16.29,6002,1718180442225-EventThread"
>  #610 daemon prio=5 os_prio=0 cpu=0.27ms elapsed=466.94s 
> tid=0x00007efc58179000 nid=0x5a051 waiting on condition [0x00007efc2cdef000]
> "ReplicationExecutor-0.replicationSource,lizy_test_replication-10.0.16.29,6002,1718180442225-SendThread(10.0.16.100:2181)"
>  #609 daemon prio=5 os_prio=0 cpu=3.02ms elapsed=466.94s 
> tid=0x00007efc58178800 nid=0x5a050 runnable [0x00007efc2cef0000]
> "ReplicationExecutor-0.replicationSource,lizy_test_replication-10.0.16.9,6002,1718180457260-EventThread"
>  #505 daemon prio=5 os_prio=0 cpu=0.27ms elapsed=556.09s 
> tid=0x00007efc50094800 nid=0x59c04 waiting on condition [0x00007efc2d7f7000]
> "ReplicationExecutor-0.replicationSource,lizy_test_replication-10.0.16.9,6002,1718180457260-SendThread(10.0.16.100:2181)"
>  #504 daemon prio=5 os_prio=0 cpu=3.72ms elapsed=556.09s 
> tid=0x00007efc50093000 nid=0x59c03 runnable [0x00007efc2d8f8000] {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to