[ https://issues.apache.org/jira/browse/ZOOKEEPER-4541?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17536036#comment-17536036 ]
Håkon Hallingstad commented on ZOOKEEPER-4541: ---------------------------------------------- One might suspect HW issues, but I found nothing suspicious in dmesg and journalctl from what I can see. We have run ZK 3.7.0 for about 6 months and there are no recent changes to how we run ZK. Server 1 did not have a frozen view of the state of the znode tree: The set of children of {{/vespa/host-status-service/hosted-vespa:zone-config-servers/lock2}} changed over time, and the set was identical on all servers, except that server 1 also saw that extraneous {{_c_be5a2484-da16-45a6-9a98-4657924040e3-lock-0000024982}} znode. > Ephemeral znode owned by closed session visible in 1 of 3 servers > ----------------------------------------------------------------- > > Key: ZOOKEEPER-4541 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4541 > Project: ZooKeeper > Issue Type: Bug > Components: quorum, server > Reporter: Håkon Hallingstad > Priority: Major > > We have a ZooKeeper 3.7.0 ensemble with servers 1-3. Using zkCli.sh we saw > the following znode on server 1: > {code:java} > stat > /vespa/host-status-service/hosted-vespa:zone-config-servers/lock2/_c_be5a2484-da16-45a6-9a98-4657924040e3-lock-0000024982 > cZxid = 0xa240000381f > ctime = Tue May 10 17:17:27 UTC 2022 > mZxid = 0xa240000381f > mtime = Tue May 10 17:17:27 UTC 2022 > pZxid = 0xa240000381f > cversion = 0 > dataVersion = 0 > aclVersion = 0 > ephemeralOwner = 0x20006d5d6a10000 > dataLength = 14 > numChildren = 0 > {code} > This znode was absent on server 2 and 3. A delete on the node failed > everywhere. > {code:java} > delete > /vespa/host-status-service/hosted-vespa:zone-config-servers/lock2/_c_be5a2484-da16-45a6-9a98-4657924040e3-lock-0000024982 > Node does not exist: > /vespa/host-status-service/hosted-vespa:zone-config-servers/lock2/_c_be5a2484-da16-45a6-9a98-4657924040e3-lock-0000024982 > {code} > This makes sense as a mutable operation goes to the leader, which was server > 3 and where the node is absent. Restarting server 1 did not fix the issue. > Stopping server 1, removing the zookeeper database and version-2 directory, > and starting the server fixed the issue. > Session 0x20006d5d6a10000 was created by a ZooKeeper client and initially > connected to server 2. The client later closed the session at around the same > time as the then-leader server 2 was stopped: > {code:java} > 2022-05-10 17:17:28.141 I - cfg2 Submitting global closeSession request for > session 0x20006d5d6a10000 (ZooKeeperServer) > 2022-05-10 17:17:28.145 I - cfg2 Session: 0x20006d5d6a10000 closed (ZooKeeper) > {code} > That both were closed/shutdown at the same time is something we do on our > side. Perhaps there is a race in the handling of closing of sessions and the > shutdown of the leader? > We did a {{dump}} of server 1-3, and server 1 had two sessions that did not > exist in server 2 and 3, and there was one ephemeral node reported on 1 that > was not reported on server 2 and 3. The ephemeral owner matched one of the > two extraneous sessions. The dump from server 1 with the extra entries: > {code:java} > Global Sessions(8): > ... > 0x20006d5d6a10000 120000ms > 0x2004360ecca0000 120000ms > ... > Sessions with Ephemerals (4): > 0x20006d5d6a10000: > /vespa/host-status-service/hosted-vespa:zone-config-servers/lock2/_c_be5a2484-da16-45a6-9a98-4657924040e3-lock-0000024982 > {code} -- This message was sent by Atlassian Jira (v8.20.7#820007)