[
https://issues.apache.org/jira/browse/CURATOR-595?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17330919#comment-17330919
]
Francesco Nigro commented on CURATOR-595:
-----------------------------------------
I've further progressed on this:
{code:java}
INFO org.apache.curator.framework.state.ConnectionStateManager State change:
SUSPENDED [main-EventThread]
WARN org.apache.curator.framework.state.ConnectionStateManager Session
timeout has elapsed while SUSPENDED. Injecting a session expiration. Elapsed
ms: 4001. Adjusted session timeout ms: 4000 [Curator-ConnectionStateManager-0]
WARN org.apache.curator.ConnectionState Session expired event received
[main-EventThread]
DEBUG org.apache.curator.ConnectionState reset [main-EventThread]
Breakpoint reached
at
org.apache.zookeeper.ZooKeeperTestable.injectSessionExpiration(ZooKeeperTestable.java:40)
at
org.apache.curator.framework.state.ConnectionStateManager.checkSessionExpiration(ConnectionStateManager.java:321)
at
org.apache.curator.framework.state.ConnectionStateManager.processEvents(ConnectionStateManager.java:285)
at
org.apache.curator.framework.state.ConnectionStateManager.access$000(ConnectionStateManager.java:43)
at
org.apache.curator.framework.state.ConnectionStateManager$1.call(ConnectionStateManager.java:132)
at java.util.concurrent.FutureTask.run$$$capture(FutureTask.java:264)
at java.util.concurrent.FutureTask.run(FutureTask.java:-1)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1130)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:630)
at java.lang.Thread.run(Thread.java:832)
inject closed on sessionid:0x1000eb508090000 local:/127.0.0.1:58770
remoteserver:localhost/127.0.0.1:36721 lastZxid:9 xid:18 sent:18 recv:18
queuedpkts:1 pendingresp:0 queuedevents:1 with
Thread[main-SendThread(127.0.0.1:36721),5,main] 1780531211
Breakpoint reached
at
org.apache.zookeeper.ClientCnxn$SendThread.startConnect(ClientCnxn.java:1124)
at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1182)
setting state from CLOSED to CONNECTING on
Thread[main-SendThread(127.0.0.1:36721),5,main] 1780531211
INFO org.apache.curator.framework.state.ConnectionStateManager State change:
LOST [main-EventThread]
{code}
The log shows that a session expired call from curator is injecting a session
expiration with ZooKeeperTestable::injectSessionExpiration, that's going to set
ClientCnxn::state == CLOSED, but, given that there was already a running client
with its own SenderThread, trying to reconnect, SendThread::startConnect is
going to set ClientCnxn::state == CONNECTING overwriting the CLOSED state and
making the SenderThread to leak.
I still don't know what curator can do in order to save this to happen, I'm
going into the rabbit hole of the zookeeper client + curator now.
> InterProcessSemaphoreV2 LOST isn't releasing permits for other clients
> ----------------------------------------------------------------------
>
> Key: CURATOR-595
> URL: https://issues.apache.org/jira/browse/CURATOR-595
> Project: Apache Curator
> Issue Type: Bug
> Components: Recipes
> Affects Versions: 5.1.0
> Reporter: Francesco Nigro
> Assignee: Jordan Zimmerman
> Priority: Major
>
> I'm not sure this is the right place to raise this, but I've added this test
> on TestInterProcessSemaphore:
> {code:java}
> @Test
> public void testAcquireAfterLostServerOnRestart() throws Exception {
> final int sessionTimout = 4000;
> final int connectionTimout = 2000;
> try (CuratorFramework client =
> CuratorFrameworkFactory.newClient(server.getConnectString(), sessionTimout,
> connectionTimout, new RetryNTimes(0, 1))) {
> client.start();
> client.blockUntilConnected();
> final InterProcessSemaphoreV2 semaphore = new
> InterProcessSemaphoreV2(client, "/1", 1);
> assertNotNull(semaphore.acquire());
> CountDownLatch lost = new CountDownLatch(1);
> client.getConnectionStateListenable().addListener((client1,
> newState) -> {
> if (newState == ConnectionState.LOST) {
> lost.countDown();
> }
> });
> server.stop();
> lost.await();
> }
> server.restart();
> try (CuratorFramework client =
> CuratorFrameworkFactory.newClient(server.getConnectString(), sessionTimout,
> connectionTimout, new RetryNTimes(0, 1))) {
> client.start();
> client.blockUntilConnected();
> final InterProcessSemaphoreV2 semaphore = new
> InterProcessSemaphoreV2(client, "/1", 1);
> final int serverTick = ZooKeeperServer.DEFAULT_TICK_TIME;
> Thread.sleep(sessionTimout + serverTick);
> assertNotNull(semaphore.acquire(0, TimeUnit.SECONDS));
> }
> }
> {code}
> And this is not passing: the doc of InterProcessSemaphoreV2 state that
> bq. "However, if the client session drops (crash, etc.), any leases held by
> the client are automatically closed and made available to other clients."
> maybe I'm missing something obvious on the ZK server config instead.
> Just checked out that by running on separated processes the same test:
> # start server on process A
> # start lease acquire on process B, listening for LOST events before suicide
> # restart server on Process A cause process B to suicide (as expected)
> # start lease acquire on process C, now succeed
> It seems that there is something going on in the intra-process case that's
> not working as expected (to me, at least).
> NOTE: as written in newer comments, raising the timeout doesn't seems to work
> too and different boxes are getting different outcomes (making this an
> intermittent failure).
--
This message was sent by Atlassian Jira
(v8.3.4#803005)