[ https://issues.apache.org/jira/browse/ZOOKEEPER-4541?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17872024#comment-17872024 ]
Benoit Sigoure edited comment on ZOOKEEPER-4541 at 8/8/24 2:51 PM: ------------------------------------------------------------------- Just found a quorum on a dev cluster we have that also exhibits this problem: leader & 1 follower don't have an ephemeral znode for an app that went away, but 1 follower still has the ephemeral znode. The follower that exhibits the problem restarted 22h ago, but the leader and follower that are fine haven't restarted in several days. The ephemeral znode that lingers should've been deleted 12h ago when the app that created it went away. Edit: upon looking at the logs more closely, the signature looks exactly like ZOOKEEPER-4394 was (Author: tsuna): Just found a quorum on a dev cluster we have that also exhibits this problem: leader & 1 follower don't have an ephemeral znode for an app that went away, but 1 follower still has the ephemeral znode. The follower that exhibits the problem restarted 22h ago, but the leader and follower that are fine haven't restarted in several days. The ephemeral znode that lingers should've been deleted 12h ago when the app that created it went away. > Ephemeral znode owned by closed session visible in 1 of 3 servers > ----------------------------------------------------------------- > > Key: ZOOKEEPER-4541 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4541 > Project: ZooKeeper > Issue Type: Bug > Components: quorum, server > Reporter: Håkon Hallingstad > Priority: Major > Labels: pull-request-available > Time Spent: 9h > Remaining Estimate: 0h > > We have a ZooKeeper 3.7.0 ensemble with servers 1-3. Using zkCli.sh we saw > the following znode on server 1: > {code:java} > stat > /vespa/host-status-service/hosted-vespa:zone-config-servers/lock2/_c_be5a2484-da16-45a6-9a98-4657924040e3-lock-0000024982 > cZxid = 0xa240000381f > ctime = Tue May 10 17:17:27 UTC 2022 > mZxid = 0xa240000381f > mtime = Tue May 10 17:17:27 UTC 2022 > pZxid = 0xa240000381f > cversion = 0 > dataVersion = 0 > aclVersion = 0 > ephemeralOwner = 0x20006d5d6a10000 > dataLength = 14 > numChildren = 0 > {code} > This znode was absent on server 2 and 3. A delete on the node failed > everywhere. > {code:java} > delete > /vespa/host-status-service/hosted-vespa:zone-config-servers/lock2/_c_be5a2484-da16-45a6-9a98-4657924040e3-lock-0000024982 > Node does not exist: > /vespa/host-status-service/hosted-vespa:zone-config-servers/lock2/_c_be5a2484-da16-45a6-9a98-4657924040e3-lock-0000024982 > {code} > This makes sense as a mutable operation goes to the leader, which was server > 3 and where the node is absent. Restarting server 1 did not fix the issue. > Stopping server 1, removing the zookeeper database and version-2 directory, > and starting the server fixed the issue. > Session 0x20006d5d6a10000 was created by a ZooKeeper client and initially > connected to server 2. The client later closed the session at around the same > time as the then-leader server 2 was stopped: > {code:java} > 2022-05-10 17:17:28.141 I - cfg2 Submitting global closeSession request for > session 0x20006d5d6a10000 (ZooKeeperServer) > 2022-05-10 17:17:28.145 I - cfg2 Session: 0x20006d5d6a10000 closed (ZooKeeper) > {code} > That both were closed/shutdown at the same time is something we do on our > side. Perhaps there is a race in the handling of closing of sessions and the > shutdown of the leader? > We did a {{dump}} of server 1-3, and server 1 had two sessions that did not > exist in server 2 and 3, and there was one ephemeral node reported on 1 that > was not reported on server 2 and 3. The ephemeral owner matched one of the > two extraneous sessions. The dump from server 1 with the extra entries: > {code:java} > Global Sessions(8): > ... > 0x20006d5d6a10000 120000ms > 0x2004360ecca0000 120000ms > ... > Sessions with Ephemerals (4): > 0x20006d5d6a10000: > /vespa/host-status-service/hosted-vespa:zone-config-servers/lock2/_c_be5a2484-da16-45a6-9a98-4657924040e3-lock-0000024982 > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)