[jira] [Comment Edited] (CURATOR-595) InterProcessSemaphoreV2 LOST isn't releasing permits for other clients

Francesco Nigro (Jira) Fri, 23 Apr 2021 10:12:22 -0700


    [ 
https://issues.apache.org/jira/browse/CURATOR-595?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17330919#comment-17330919
 ]


Francesco Nigro edited comment on CURATOR-595 at 4/23/21, 5:11 PM:
-------------------------------------------------------------------

I've further progressed on this:

{code:java}
INFO  org.apache.curator.framework.state.ConnectionStateManager  State change: 
SUSPENDED [main-EventThread]
WARN  org.apache.curator.framework.state.ConnectionStateManager  Session 
timeout has elapsed while SUSPENDED. Injecting a session expiration. Elapsed 
ms: 4001. Adjusted session timeout ms: 4000 [Curator-ConnectionStateManager-0]
WARN  org.apache.curator.ConnectionState  Session expired event received 
[main-EventThread]
DEBUG org.apache.curator.ConnectionState  reset [main-EventThread]
Breakpoint reached
        at 
org.apache.zookeeper.ZooKeeperTestable.injectSessionExpiration(ZooKeeperTestable.java:40)
        at 
org.apache.curator.framework.state.ConnectionStateManager.checkSessionExpiration(ConnectionStateManager.java:321)
        at 
org.apache.curator.framework.state.ConnectionStateManager.processEvents(ConnectionStateManager.java:285)
        at 
org.apache.curator.framework.state.ConnectionStateManager.access$000(ConnectionStateManager.java:43)
        at 
org.apache.curator.framework.state.ConnectionStateManager$1.call(ConnectionStateManager.java:132)
        at java.util.concurrent.FutureTask.run$$$capture(FutureTask.java:264)
        at java.util.concurrent.FutureTask.run(FutureTask.java:-1)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1130)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:630)
        at java.lang.Thread.run(Thread.java:832)
inject closed on sessionid:0x1000eb508090000 local:/127.0.0.1:58770 
remoteserver:localhost/127.0.0.1:36721 lastZxid:9 xid:18 sent:18 recv:18 
queuedpkts:1 pendingresp:0 queuedevents:1 with 
Thread[main-SendThread(127.0.0.1:36721),5,main] 1780531211
Breakpoint reached
        at 
org.apache.zookeeper.ClientCnxn$SendThread.startConnect(ClientCnxn.java:1124)
        at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1182)
setting state from CLOSED to CONNECTING on 
Thread[main-SendThread(127.0.0.1:36721),5,main] 1780531211
INFO  org.apache.curator.framework.state.ConnectionStateManager  State change: 
LOST [main-EventThread]
{code}

The log shows that a session expired call from curator is injecting a session 
expiration with ZooKeeperTestable::injectSessionExpiration, that's going to set 
ClientCnxn::state == CLOSED, but, given that there was already a running client 
with its own SenderThread duty cycle loop (calling select with a session 
timeout), trying to reconnect,  SendThread::startConnect set 
 ClientCnxn::state == CONNECTING overwriting the CLOSED state and making the 
SenderThread to leak.

I still don't know what curator can do in order to save this to happen, I'm 
going into the rabbit hole of the zookeeper client + curator now.

I just see that org.apache.zookeeper.Testable used by curator to inject the 
expiration simulation is probably racy (as expected) and meant to be used only 
for testing purpose, as the ZooKeeper doc states: 

{code:java}
    // VisibleForTesting
    public Testable getTestable() {
        return new ZooKeeperTestable(cnxn);
    }
{code}

Probably using this method needs some extra check to make it works reliably, 
wdyt [~randgalt]?


was (Author: nigrofranz):
I've further progressed on this:

{code:java}
INFO  org.apache.curator.framework.state.ConnectionStateManager  State change: 
SUSPENDED [main-EventThread]
WARN  org.apache.curator.framework.state.ConnectionStateManager  Session 
timeout has elapsed while SUSPENDED. Injecting a session expiration. Elapsed 
ms: 4001. Adjusted session timeout ms: 4000 [Curator-ConnectionStateManager-0]
WARN  org.apache.curator.ConnectionState  Session expired event received 
[main-EventThread]
DEBUG org.apache.curator.ConnectionState  reset [main-EventThread]
Breakpoint reached
        at 
org.apache.zookeeper.ZooKeeperTestable.injectSessionExpiration(ZooKeeperTestable.java:40)
        at 
org.apache.curator.framework.state.ConnectionStateManager.checkSessionExpiration(ConnectionStateManager.java:321)
        at 
org.apache.curator.framework.state.ConnectionStateManager.processEvents(ConnectionStateManager.java:285)
        at 
org.apache.curator.framework.state.ConnectionStateManager.access$000(ConnectionStateManager.java:43)
        at 
org.apache.curator.framework.state.ConnectionStateManager$1.call(ConnectionStateManager.java:132)
        at java.util.concurrent.FutureTask.run$$$capture(FutureTask.java:264)
        at java.util.concurrent.FutureTask.run(FutureTask.java:-1)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1130)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:630)
        at java.lang.Thread.run(Thread.java:832)
inject closed on sessionid:0x1000eb508090000 local:/127.0.0.1:58770 
remoteserver:localhost/127.0.0.1:36721 lastZxid:9 xid:18 sent:18 recv:18 
queuedpkts:1 pendingresp:0 queuedevents:1 with 
Thread[main-SendThread(127.0.0.1:36721),5,main] 1780531211
Breakpoint reached
        at 
org.apache.zookeeper.ClientCnxn$SendThread.startConnect(ClientCnxn.java:1124)
        at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1182)
setting state from CLOSED to CONNECTING on 
Thread[main-SendThread(127.0.0.1:36721),5,main] 1780531211
INFO  org.apache.curator.framework.state.ConnectionStateManager  State change: 
LOST [main-EventThread]
{code}

The log shows that a session expired call from curator is injecting a session 
expiration with ZooKeeperTestable::injectSessionExpiration, that's going to set 
ClientCnxn::state == CLOSED, but, given that there was already a running client 
with its own SenderThread duty cycle loop (calling select with a session 
timeout), trying to reconnect,  SendThread::startConnect set 
 ClientCnxn::state == CONNECTING overwriting the CLOSED state and making the 
SenderThread to leak.

I still don't know what curator can do in order to save this to happen, I'm 
going into the rabbit hole of the zookeeper client + curator now.


> InterProcessSemaphoreV2 LOST isn't releasing permits for other clients
> ----------------------------------------------------------------------
>
>                 Key: CURATOR-595
>                 URL: https://issues.apache.org/jira/browse/CURATOR-595
>             Project: Apache Curator
>          Issue Type: Bug
>          Components: Recipes
>    Affects Versions: 5.1.0
>            Reporter: Francesco Nigro
>            Assignee: Jordan Zimmerman
>            Priority: Major
>
> I'm not sure this is the right place to raise this, but I've added this test 
> on TestInterProcessSemaphore:
> {code:java}
>     @Test
>     public void testAcquireAfterLostServerOnRestart() throws Exception {
>         final int sessionTimout = 4000;
>         final int connectionTimout = 2000;
>         try (CuratorFramework client = 
> CuratorFrameworkFactory.newClient(server.getConnectString(), sessionTimout, 
> connectionTimout, new RetryNTimes(0, 1))) {
>             client.start();
>             client.blockUntilConnected();
>             final InterProcessSemaphoreV2 semaphore = new 
> InterProcessSemaphoreV2(client, "/1", 1);
>             assertNotNull(semaphore.acquire());
>             CountDownLatch lost = new CountDownLatch(1);
>             client.getConnectionStateListenable().addListener((client1, 
> newState) -> {
>                 if (newState == ConnectionState.LOST) {
>                     lost.countDown();
>                 }
>             });
>             server.stop();
>             lost.await();
>         }
>         server.restart();
>         try (CuratorFramework client = 
> CuratorFrameworkFactory.newClient(server.getConnectString(), sessionTimout, 
> connectionTimout, new RetryNTimes(0, 1))) {
>             client.start();
>             client.blockUntilConnected();
>             final InterProcessSemaphoreV2 semaphore = new 
> InterProcessSemaphoreV2(client, "/1", 1);
>             final int serverTick = ZooKeeperServer.DEFAULT_TICK_TIME;
>             Thread.sleep(sessionTimout + serverTick);
>             assertNotNull(semaphore.acquire(0, TimeUnit.SECONDS));
>         }
>     }
> {code}
> And this is not passing: the doc of InterProcessSemaphoreV2 state that 
> bq. "However, if the client session drops (crash, etc.), any leases held by 
> the client are automatically closed and made available to other clients." 
> maybe I'm missing something obvious on the ZK server config instead.
> Just checked out that by running on separated processes the same test:
> # start server on process A
> # start lease acquire on process B, listening for LOST events before suicide
> # restart server on Process A cause process B to suicide (as expected)
> # start lease acquire on process C, now succeed
> It seems that there is something going on in the intra-process case that's 
> not working as expected (to me, at least).
> NOTE: as written in newer comments, raising the timeout doesn't seems to work 
> too and different boxes are getting different outcomes (making this an 
> intermittent failure).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Comment Edited] (CURATOR-595) InterProcessSemaphoreV2 LOST isn't releasing permits for other clients

Reply via email to