[
https://issues.apache.org/jira/browse/HBASE-27193?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Work on HBASE-27193 started by Duo Zhang.
-----------------------------------------
> TestZooKeeper is flaky
> ----------------------
>
> Key: HBASE-27193
> URL: https://issues.apache.org/jira/browse/HBASE-27193
> Project: HBase
> Issue Type: Bug
> Components: test, Zookeeper
> Reporter: Duo Zhang
> Assignee: Duo Zhang
> Priority: Major
>
> https://nightlies.apache.org/hbase/HBase-Flaky-Tests/master/3847/hbase-server/target/surefire-reports/org.apache.hadoop.hbase.TestZooKeeper-output.txt
> Found this in the log output
> {noformat}
> 2022-07-11T20:46:34,839 ERROR [zk-event-processor-pool-0]
> master.HMaster(3054): Master server abort: loaded coprocessors are:
> [org.apache.hadoop.hbase.coprocessor.MultiRowMutationEndpoint]
> 2022-07-11T20:46:34,839 ERROR [zk-event-processor-pool-0]
> master.HMaster(3059): ***** ABORTING master
> jenkins-hbase19.apache.org,44207,1657572391676: Received an unexpected
> KeeperException, aborting *****
> org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode
> = Session expired for /hbase/master
> at
> org.apache.zookeeper.KeeperException.create(KeeperException.java:134)
> ~[zookeeper-3.5.7.jar:3.5.7]
> at org.apache.zookeeper.KeeperException.create(KeeperException.java:54)
> ~[zookeeper-3.5.7.jar:3.5.7]
> at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:2021)
> ~[zookeeper-3.5.7.jar:3.5.7]
> at
> org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.exists(RecoverableZooKeeper.java:262)
> ~[hbase-zookeeper-3.0.0-alpha-4-SNAPSHOT.jar:3.0.0-alpha-4-SNAPSHOT]
> at
> org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.exists(RecoverableZooKeeper.java:250)
> ~[hbase-zookeeper-3.0.0-alpha-4-SNAPSHOT.jar:3.0.0-alpha-4-SNAPSHOT]
> at
> org.apache.hadoop.hbase.zookeeper.ZKUtil.watchAndCheckExists(ZKUtil.java:108)
> ~[hbase-zookeeper-3.0.0-alpha-4-SNAPSHOT.jar:3.0.0-alpha-4-SNAPSHOT]
> at
> org.apache.hadoop.hbase.master.ActiveMasterManager.handleMasterNodeChange(ActiveMasterManager.java:197)
> ~[classes/:3.0.0-alpha-4-SNAPSHOT]
> at
> org.apache.hadoop.hbase.master.ActiveMasterManager.handle(ActiveMasterManager.java:130)
> ~[classes/:3.0.0-alpha-4-SNAPSHOT]
> at
> org.apache.hadoop.hbase.master.ActiveMasterManager.nodeDeleted(ActiveMasterManager.java:125)
> ~[classes/:3.0.0-alpha-4-SNAPSHOT]
> at
> org.apache.hadoop.hbase.zookeeper.ZKWatcher.lambda$processEvent$0(ZKWatcher.java:574)
> ~[hbase-zookeeper-3.0.0-alpha-4-SNAPSHOT.jar:3.0.0-alpha-4-SNAPSHOT]
> at org.apache.hadoop.hbase.trace.TraceUtil.trace(TraceUtil.java:187)
> ~[hbase-common-3.0.0-alpha-4-SNAPSHOT.jar:3.0.0-alpha-4-SNAPSHOT]
> at org.apache.hadoop.hbase.trace.TraceUtil.trace(TraceUtil.java:177)
> ~[hbase-common-3.0.0-alpha-4-SNAPSHOT.jar:3.0.0-alpha-4-SNAPSHOT]
> at
> org.apache.hadoop.hbase.zookeeper.ZKWatcher.processEvent(ZKWatcher.java:556)
> ~[hbase-zookeeper-3.0.0-alpha-4-SNAPSHOT.jar:3.0.0-alpha-4-SNAPSHOT]
> at
> org.apache.hadoop.hbase.zookeeper.ZKWatcher.lambda$process$1(ZKWatcher.java:609)
> ~[hbase-zookeeper-3.0.0-alpha-4-SNAPSHOT.jar:3.0.0-alpha-4-SNAPSHOT]
> at
> org.apache.hadoop.hbase.trace.TraceUtil.lambda$tracedRunnable$2(TraceUtil.java:155)
> ~[hbase-common-3.0.0-alpha-4-SNAPSHOT.jar:3.0.0-alpha-4-SNAPSHOT]
> at
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> ~[?:1.8.0_291]
> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> ~[?:1.8.0_291]
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> ~[?:1.8.0_291]
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> ~[?:1.8.0_291]
> at java.lang.Thread.run(Thread.java:748) ~[?:1.8.0_291]
> 2022-07-11T20:46:34,839 DEBUG [Listener at localhost.localdomain/39949]
> master.HMaster(3049): Abort called but aborted=true, stopped=false
> 2022-07-11T20:46:34,839 INFO [zk-event-processor-pool-0]
> master.HMaster(3153): ***** STOPPING master
> 'jenkins-hbase19.apache.org,44207,1657572391676' *****
> 2022-07-11T20:46:34,840 INFO [zk-event-processor-pool-0]
> master.HMaster(3155): STOPPED: Stopped by zk-event-processor-pool-0
> {noformat}
> So the problem here is that, we call abort manually in UT, but it is possible
> that after closing zookeeper client, the master itself will call abort in
> some background threads, and there is a race that, the abortedRequested is
> set, but stopped is not set while we call abort manually, so we will return
> immediately and once we check for master.isStopped, it is still false, as the
> log output above.
> It is just a test issue. Let me provide a very simple patch to fix it, just a
> wait is enough...
--
This message was sent by Atlassian Jira
(v8.20.10#820010)