[ https://issues.apache.org/jira/browse/HBASE-26963?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Andrew Kyle Purtell resolved HBASE-26963. ----------------------------------------- Fix Version/s: 2.5.0 3.0.0-alpha-3 2.4.13 Hadoop Flags: Reviewed Resolution: Fixed > ReplicationSource#removePeer hangs if we try to remove bad peer. > ---------------------------------------------------------------- > > Key: HBASE-26963 > URL: https://issues.apache.org/jira/browse/HBASE-26963 > Project: HBase > Issue Type: Bug > Components: regionserver, Replication > Affects Versions: 2.5.0, 3.0.0-alpha-2, 2.4.11 > Reporter: Rushabh Shah > Assignee: Rushabh Shah > Priority: Major > Fix For: 2.5.0, 3.0.0-alpha-3, 2.4.13 > > Attachments: HBASE-26963.patch > > > ReplicationSource#removePeer hangs if we try to remove bad peer. > Steps to reproduce: > 1. Set config replication.source.regionserver.abort to false so that it > doesn't abort regionserver. > 2. Add a dummy peer. > 2. Remove that peer. > RemovePeer call will hang indefinitely until the test times out. > Attached a patch to reproduce the above behavior. > I can see following threads in the stack trace: > {noformat} > "RS_REFRESH_PEER-regionserver/rushabh-ltmflld:0-0.replicationSource,dummypeer_1" > #339 daemon prio=5 os_prio=31 tid=0x00007f8caa > 44a800 nid=0x22107 waiting on condition [0x00007000107e5000] > java.lang.Thread.State: TIMED_WAITING (sleeping) > at java.lang.Thread.sleep(Native Method) > at > org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.sleepForRetries(ReplicationSource.java:511) > at > org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.initialize(ReplicationSource.java:577) > at > org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.lambda$startup$4(ReplicationSource.java:633) > at > org.apache.hadoop.hbase.replication.regionserver.ReplicationSource$$Lambda$350/89698794.uncaughtException(Unknown > Source) > at java.lang.Thread.dispatchUncaughtException(Thread.java:1959) > {noformat} > {noformat} > "RS_REFRESH_PEER-regionserver/rushabh-ltmflld:0-0" #338 daemon prio=5 > os_prio=31 tid=0x00007f8ca82fa800 nid=0x22307 in Object.wait() > [0x00007000106e2000] > java.lang.Thread.State: TIMED_WAITING (on object monitor) > at java.lang.Object.wait(Native Method) > at java.lang.Thread.join(Thread.java:1260) > - locked <0x0000000799975ea0> (a java.lang.Thread) > at org.apache.hadoop.hbase.util.Threads.shutdown(Threads.java:106) > at > org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.terminate(ReplicationSource.java:674) > at > org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.terminate(ReplicationSource.java:657) > at > org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.terminate(ReplicationSource.java:652) > at > org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.terminate(ReplicationSource.java:647) > at > org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceManager.removePeer(ReplicationSourceManager.java:330) > at > org.apache.hadoop.hbase.replication.regionserver.PeerProcedureHandlerImpl.removePeer(PeerProcedureHandlerImpl.java:56) > at > org.apache.hadoop.hbase.replication.regionserver.RefreshPeerCallable.call(RefreshPeerCallable.java:61) > at > org.apache.hadoop.hbase.replication.regionserver.RefreshPeerCallable.call(RefreshPeerCallable.java:35) > at > org.apache.hadoop.hbase.regionserver.handler.RSProcedureHandler.process(RSProcedureHandler.java:49) > at > org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:104) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > {noformat} > {noformat} > "Listener at localhost/55013" #20 daemon prio=5 os_prio=31 > tid=0x00007f8caf95a000 nid=0x6703 waiting on condition [0x0000700002 > 544000] > java.lang.Thread.State: TIMED_WAITING (sleeping) > at java.lang.Thread.sleep(Native Method) > at > org.apache.hadoop.hbase.client.HBaseAdmin$ProcedureFuture.waitProcedureResult(HBaseAdmin.java:3442) > at > org.apache.hadoop.hbase.client.HBaseAdmin$ProcedureFuture.get(HBaseAdmin.java:3372) > at org.apache.hadoop.hbase.util.FutureUtils.get(FutureUtils.java:182) > at > org.apache.hadoop.hbase.client.Admin.removeReplicationPeer(Admin.java:2861) > at > org.apache.hadoop.hbase.client.replication.TestBadReplicationPeer.cleanPeer(TestBadReplicationPeer.java:74) > at > org.apache.hadoop.hbase.client.replication.TestBadReplicationPeer.testWrongReplicationEndpoint(TestBadReplicationPeer.java:66) > {noformat} > The main thread "TestBadReplicationPeer.testWrongReplicationEndpoint" is > waiting for Admin#removeReplicationPeer. > The refreshPeer thread (PeerProcedureHandlerImpl#removePeer) responsible to > terminate peer (#338) is waiting on ReplicationSource thread to be terminated. > The ReplicateSource thread (#339) is in sleeping state. Notice that this > thread's stack trace is in ReplicationSource#uncaughtException method. > When we call ReplicationSourceManager#removePeer, we set sourceRunning flag > to false, send an interrupt signal to ReplicationSource thread > [here|https://github.com/apache/hbase/blob/branch-2.4/hbase-server/src/main/java/org/apache/hadoop/hbase/replication/regionserver/ReplicationSource.java#L668-L674]. > In this case ReplicationSource was waiting to read cluster id of the peer > and it received an InterruptedException. > {noformat} > 2022-04-20 08:46:49,679 WARN > [RS_REFRESH_PEER-regionserver/rushabh-ltmflld:0-0.replicationSource,dummypeer_1] > zookeeper.ZKUtil(228): connection to cluster: dummypeer_1-0x100229efa200009, > quorum=127.0.0.1:55599, baseZNode=/1 Unable to set watcher on znode > (/1/hbaseid) > java.lang.InterruptedException > at java.lang.Object.wait(Native Method) > at java.lang.Object.wait(Object.java:502) > at org.apache.zookeeper.ClientCnxn.submitRequest(ClientCnxn.java:1529) > at org.apache.zookeeper.ClientCnxn.submitRequest(ClientCnxn.java:1512) > at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:2016) > at > org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.exists(RecoverableZooKeeper.java:212) > at org.apache.hadoop.hbase.zookeeper.ZKUtil.checkExists(ZKUtil.java:221) > at > org.apache.hadoop.hbase.zookeeper.ZKClusterId.readClusterIdZNode(ZKClusterId.java:65) > at > org.apache.hadoop.hbase.zookeeper.ZKClusterId.getUUIDForCluster(ZKClusterId.java:96) > at > org.apache.hadoop.hbase.replication.HBaseReplicationEndpoint.getPeerUUID(HBaseReplicationEndpoint.java:112) > at > org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.initialize(ReplicationSource.java:571) > at java.lang.Thread.run(Thread.java:748) > {noformat} > [ZKClusterId.readClusterIdZNode|https://github.com/apache/hbase/blob/branch-2.4/hbase-zookeeper/src/main/java/org/apache/hadoop/hbase/zookeeper/ZKClusterId.java#L69-L72] > catches InterruptedException and returns null. > ReplicationSource realizes that sourceRunning flag is set to false and it > will throw IllegalStateException > [here|https://github.com/apache/hbase/blob/branch-2.4/hbase-server/src/main/java/org/apache/hadoop/hbase/replication/regionserver/ReplicationSource.java#L561-L565]. > Then the control goes to > [UncaughtExceptionHandler|https://github.com/apache/hbase/blob/branch-2.4/hbase-server/src/main/java/org/apache/hadoop/hbase/replication/regionserver/ReplicationSource.java#L620-L640] > and since abortOnError is set to false, it will go into infinite sleep > causing the test to hang. -- This message was sent by Atlassian Jira (v8.20.7#820007)