[ 
https://issues.apache.org/jira/browse/CURATOR-595?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17330919#comment-17330919
 ] 

Francesco Nigro edited comment on CURATOR-595 at 4/23/21, 4:59 PM:
-------------------------------------------------------------------

I've further progressed on this:

{code:java}
INFO  org.apache.curator.framework.state.ConnectionStateManager  State change: 
SUSPENDED [main-EventThread]
WARN  org.apache.curator.framework.state.ConnectionStateManager  Session 
timeout has elapsed while SUSPENDED. Injecting a session expiration. Elapsed 
ms: 4001. Adjusted session timeout ms: 4000 [Curator-ConnectionStateManager-0]
WARN  org.apache.curator.ConnectionState  Session expired event received 
[main-EventThread]
DEBUG org.apache.curator.ConnectionState  reset [main-EventThread]
Breakpoint reached
        at 
org.apache.zookeeper.ZooKeeperTestable.injectSessionExpiration(ZooKeeperTestable.java:40)
        at 
org.apache.curator.framework.state.ConnectionStateManager.checkSessionExpiration(ConnectionStateManager.java:321)
        at 
org.apache.curator.framework.state.ConnectionStateManager.processEvents(ConnectionStateManager.java:285)
        at 
org.apache.curator.framework.state.ConnectionStateManager.access$000(ConnectionStateManager.java:43)
        at 
org.apache.curator.framework.state.ConnectionStateManager$1.call(ConnectionStateManager.java:132)
        at java.util.concurrent.FutureTask.run$$$capture(FutureTask.java:264)
        at java.util.concurrent.FutureTask.run(FutureTask.java:-1)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1130)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:630)
        at java.lang.Thread.run(Thread.java:832)
inject closed on sessionid:0x1000eb508090000 local:/127.0.0.1:58770 
remoteserver:localhost/127.0.0.1:36721 lastZxid:9 xid:18 sent:18 recv:18 
queuedpkts:1 pendingresp:0 queuedevents:1 with 
Thread[main-SendThread(127.0.0.1:36721),5,main] 1780531211
Breakpoint reached
        at 
org.apache.zookeeper.ClientCnxn$SendThread.startConnect(ClientCnxn.java:1124)
        at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1182)
setting state from CLOSED to CONNECTING on 
Thread[main-SendThread(127.0.0.1:36721),5,main] 1780531211
INFO  org.apache.curator.framework.state.ConnectionStateManager  State change: 
LOST [main-EventThread]
{code}

The log shows that a session expired call from curator is injecting a session 
expiration with ZooKeeperTestable::injectSessionExpiration, that's going to set 
ClientCnxn::state == CLOSED, but, given that there was already a running client 
with its own SenderThread duty cycle loop (calling select with a session 
timeout), trying to reconnect,  SendThread::startConnect set 
 ClientCnxn::state == CONNECTING overwriting the CLOSED state and making the 
SenderThread to leak.

I still don't know what curator can do in order to save this to happen, I'm 
going into the rabbit hole of the zookeeper client + curator now.



was (Author: nigrofranz):
I've further progressed on this:

{code:java}
INFO  org.apache.curator.framework.state.ConnectionStateManager  State change: 
SUSPENDED [main-EventThread]
WARN  org.apache.curator.framework.state.ConnectionStateManager  Session 
timeout has elapsed while SUSPENDED. Injecting a session expiration. Elapsed 
ms: 4001. Adjusted session timeout ms: 4000 [Curator-ConnectionStateManager-0]
WARN  org.apache.curator.ConnectionState  Session expired event received 
[main-EventThread]
DEBUG org.apache.curator.ConnectionState  reset [main-EventThread]
Breakpoint reached
        at 
org.apache.zookeeper.ZooKeeperTestable.injectSessionExpiration(ZooKeeperTestable.java:40)
        at 
org.apache.curator.framework.state.ConnectionStateManager.checkSessionExpiration(ConnectionStateManager.java:321)
        at 
org.apache.curator.framework.state.ConnectionStateManager.processEvents(ConnectionStateManager.java:285)
        at 
org.apache.curator.framework.state.ConnectionStateManager.access$000(ConnectionStateManager.java:43)
        at 
org.apache.curator.framework.state.ConnectionStateManager$1.call(ConnectionStateManager.java:132)
        at java.util.concurrent.FutureTask.run$$$capture(FutureTask.java:264)
        at java.util.concurrent.FutureTask.run(FutureTask.java:-1)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1130)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:630)
        at java.lang.Thread.run(Thread.java:832)
inject closed on sessionid:0x1000eb508090000 local:/127.0.0.1:58770 
remoteserver:localhost/127.0.0.1:36721 lastZxid:9 xid:18 sent:18 recv:18 
queuedpkts:1 pendingresp:0 queuedevents:1 with 
Thread[main-SendThread(127.0.0.1:36721),5,main] 1780531211
Breakpoint reached
        at 
org.apache.zookeeper.ClientCnxn$SendThread.startConnect(ClientCnxn.java:1124)
        at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1182)
setting state from CLOSED to CONNECTING on 
Thread[main-SendThread(127.0.0.1:36721),5,main] 1780531211
INFO  org.apache.curator.framework.state.ConnectionStateManager  State change: 
LOST [main-EventThread]
{code}

The log shows that a session expired call from curator is injecting a session 
expiration with ZooKeeperTestable::injectSessionExpiration, that's going to set 
ClientCnxn::state == CLOSED, but, given that there was already a running client 
with its own SenderThread, trying to reconnect,  SendThread::startConnect is 
going to set ClientCnxn::state == CONNECTING overwriting the CLOSED state and 
making the SenderThread to leak.

I still don't know what curator can do in order to save this to happen, I'm 
going into the rabbit hole of the zookeeper client + curator now.


> InterProcessSemaphoreV2 LOST isn't releasing permits for other clients
> ----------------------------------------------------------------------
>
>                 Key: CURATOR-595
>                 URL: https://issues.apache.org/jira/browse/CURATOR-595
>             Project: Apache Curator
>          Issue Type: Bug
>          Components: Recipes
>    Affects Versions: 5.1.0
>            Reporter: Francesco Nigro
>            Assignee: Jordan Zimmerman
>            Priority: Major
>
> I'm not sure this is the right place to raise this, but I've added this test 
> on TestInterProcessSemaphore:
> {code:java}
>     @Test
>     public void testAcquireAfterLostServerOnRestart() throws Exception {
>         final int sessionTimout = 4000;
>         final int connectionTimout = 2000;
>         try (CuratorFramework client = 
> CuratorFrameworkFactory.newClient(server.getConnectString(), sessionTimout, 
> connectionTimout, new RetryNTimes(0, 1))) {
>             client.start();
>             client.blockUntilConnected();
>             final InterProcessSemaphoreV2 semaphore = new 
> InterProcessSemaphoreV2(client, "/1", 1);
>             assertNotNull(semaphore.acquire());
>             CountDownLatch lost = new CountDownLatch(1);
>             client.getConnectionStateListenable().addListener((client1, 
> newState) -> {
>                 if (newState == ConnectionState.LOST) {
>                     lost.countDown();
>                 }
>             });
>             server.stop();
>             lost.await();
>         }
>         server.restart();
>         try (CuratorFramework client = 
> CuratorFrameworkFactory.newClient(server.getConnectString(), sessionTimout, 
> connectionTimout, new RetryNTimes(0, 1))) {
>             client.start();
>             client.blockUntilConnected();
>             final InterProcessSemaphoreV2 semaphore = new 
> InterProcessSemaphoreV2(client, "/1", 1);
>             final int serverTick = ZooKeeperServer.DEFAULT_TICK_TIME;
>             Thread.sleep(sessionTimout + serverTick);
>             assertNotNull(semaphore.acquire(0, TimeUnit.SECONDS));
>         }
>     }
> {code}
> And this is not passing: the doc of InterProcessSemaphoreV2 state that 
> bq. "However, if the client session drops (crash, etc.), any leases held by 
> the client are automatically closed and made available to other clients." 
> maybe I'm missing something obvious on the ZK server config instead.
> Just checked out that by running on separated processes the same test:
> # start server on process A
> # start lease acquire on process B, listening for LOST events before suicide
> # restart server on Process A cause process B to suicide (as expected)
> # start lease acquire on process C, now succeed
> It seems that there is something going on in the intra-process case that's 
> not working as expected (to me, at least).
> NOTE: as written in newer comments, raising the timeout doesn't seems to work 
> too and different boxes are getting different outcomes (making this an 
> intermittent failure).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to