[ https://issues.apache.org/jira/browse/CURATOR-561?focusedWorklogId=676398&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-676398 ]
ASF GitHub Bot logged work on CURATOR-561: ------------------------------------------ Author: ASF GitHub Bot Created on: 04/Nov/21 13:12 Start Date: 04/Nov/21 13:12 Worklog Time Spent: 10m Work Description: scott-kirk closed pull request #399: URL: https://github.com/apache/curator/pull/399 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@curator.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking ------------------- Worklog Id: (was: 676398) Time Spent: 1h (was: 50m) > Race condition preventing reconnection > -------------------------------------- > > Key: CURATOR-561 > URL: https://issues.apache.org/jira/browse/CURATOR-561 > Project: Apache Curator > Issue Type: Bug > Affects Versions: 4.2.0 > Reporter: Grant Digby > Assignee: Jordan Zimmerman > Priority: Major > Time Spent: 1h > Remaining Estimate: 0h > > Heavy GC caused several rounds of suspend, lost, reconnected - as you'd > expect. But eventually it failed to reconnect, even though ZK was up the > entire time. My thread waited on acquireLock until the retry period was up. I > believe I've tracked this down to a race condition between the Curator thread > which calls ZookKeeperTestable#injectSessionExpiration to post the expiry and > death events and the Zookeeper SendThread which posts the same. > > The EventThread is meant to receive the expiry event first so it can set up > its replacement, then it receives the death event making it shut down. > The race condition allows these two events to be received in the wrong order > so the EventThread shuts down before a replacement can be set up. > > The "Curator-ConnectionStateManager" thread calls injectSessionExpiration > which calls EventThread#queueEvent. It goes as far as my comment in the next > code snippit, sets the status and let's say it pauses here, for whatever > reason. > > {code:java} > private void queueEvent(WatchedEvent event, > Set<Watcher> materializedWatchers) { > if (event.getType() == EventType.None > && sessionState == event.getState()) { > return; > } > sessionState = event.getState(); // The curator thread sets the status > here > // It then goes onto send the expiry event, but for now it's here..the > event has not yet been sent{code} > > The SendThread then goes into the same method with its Expiry message, but > the state has already been set to expired by the curator thread so it returns > without sending the expiry message. It then sends the event of death: > {code:java} > // from ClientCnxn#onConnected > eventThread.queueEvent(new WatchedEvent( > Watcher.Event.EventType.None, > Watcher.Event.KeeperState.Expired, null)); > eventThread.queueEventOfDeath(); {code} > > The EventThread receives this event of death and shuts down. The curator > thread then continues and sends the original expiry message but it's now too > late. With no EventThread created it remains in suspended state with any > operations just retrying until they give up. > > I find it difficult to explan race conditions, but hopefully that made sense. > I can recreate the issue in Intellij using Suspend-All break points to cause > the session expiry and Suspend-Thread break points to order the threads in > the correct way: > > Run the following and put a suspend-all break point at the System.out > {code:java} > package com.gebatech.curator; > import org.apache.curator.framework.CuratorFramework; > import org.apache.curator.framework.CuratorFrameworkFactory; > import org.apache.curator.framework.recipes.locks.InterProcessReadWriteLock; > import org.apache.curator.retry.BoundedExponentialBackoffRetry; > /** > * > */ > public class GoCurator { > public static void main(String[] args) throws Exception { > CuratorFramework cf = CuratorFrameworkFactory.builder() > .connectString("localhost:2181") > .retryPolicy(new BoundedExponentialBackoffRetry(200, 10000, > 20)) > .sessionTimeoutMs(30 * 1000) > .build(); > cf.start(); > InterProcessReadWriteLock lcok = new InterProcessReadWriteLock(cf, > "/grant/myLock"); > System.out.println("Set a Suspend-ALL break point here and leave it > until the session expires, 30 seconds ish"); > lcok.readLock().acquire(); > System.out.println(); > } > } {code} > Whlst suspended here, put a Suspend-Thread break poing in > EventThread#queueEvent, on the if statement from the first code snippet. > > When you see the following log line, you can let the program resume; > {code:java} > 2020-02-20 20:15:05,912 [myid:] - INFO [SessionTracker:ZooKeeperServer@398] - > Expiring session 0x1006300f2390000, timeout of 30000ms exceeded {code} > The SendThread will hit your break point first with a Disconnect event, you > can let this run. > Next you'll get the SendThread with the Expiry event, wait here until you get > the popup telling you that the Curator thread has hit the same break point. > Switch to this and step through until it has set the expiry status, then > leave it held there and switch back to the SendThread. Let the sendThread > run, then remove break points and let everything else run. > > The program will then be stuck and you'll see this stacktrace for the acquire > lock method > {code:java} > "main@1" prio=5 tid=0x1 nid=NA waiting"main@1" prio=5 tid=0x1 nid=NA waiting > java.lang.Thread.State: WAITING at java.lang.Object.wait(Object.java:-1) > at java.lang.Object.wait(Object.java:502) at > org.apache.zookeeper.ClientCnxn.submitRequest(ClientCnxn.java:1499) at > org.apache.zookeeper.ClientCnxn.submitRequest(ClientCnxn.java:1487) at > org.apache.zookeeper.ZooKeeper.create(ZooKeeper.java:1547) at > org.apache.curator.framework.imps.CreateBuilderImpl$17.call(CreateBuilderImpl.java:1180) > at > org.apache.curator.framework.imps.CreateBuilderImpl$17.call(CreateBuilderImpl.java:1156) > at > org.apache.curator.connection.StandardConnectionHandlingPolicy.callWithRetry(StandardConnectionHandlingPolicy.java:64) > at org.apache.curator.RetryLoop.callWithRetry(RetryLoop.java:100) at > org.apache.curator.framework.imps.CreateBuilderImpl.pathInForeground(CreateBuilderImpl.java:1153) > at > org.apache.curator.framework.imps.CreateBuilderImpl.protectedPathInForeground(CreateBuilderImpl.java:607) > at > org.apache.curator.framework.imps.CreateBuilderImpl.forPath(CreateBuilderImpl.java:597) > at > org.apache.curator.framework.imps.CreateBuilderImpl.forPath(CreateBuilderImpl.java:575) > at > org.apache.curator.framework.imps.CreateBuilderImpl.forPath(CreateBuilderImpl.java:51) > at > org.apache.curator.framework.recipes.locks.StandardLockInternalsDriver.createsTheLock(StandardLockInternalsDriver.java:54) > at > org.apache.curator.framework.recipes.locks.LockInternals.attemptLock(LockInternals.java:225) > at > org.apache.curator.framework.recipes.locks.InterProcessMutex.internalLock(InterProcessMutex.java:237) > at > org.apache.curator.framework.recipes.locks.InterProcessMutex.acquire(InterProcessMutex.java:89) > at com.gebatech.curator.GoCurator.main(GoCurator.java:26) {code} > > I know this issue falls between Zookeeper and Curator but thought I'd report > it in case you can think of a solution. I think ideally they'd send a single > expiry event with a death flag set or, performance impact allowing, > synchronize that method. -- This message was sent by Atlassian Jira (v8.3.4#803005)