[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-3890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17178869#comment-17178869
 ] 

Lea Morschel edited comment on ZOOKEEPER-3890 at 8/17/20, 9:34 AM:
-------------------------------------------------------------------

Sorry for taking so long for an answer and thank you for investigating!

The problem we observed this issue with is an embedded ZooKeeper. After 
investigations I discovered that it includes workarounds for issues 
https://issues.apache.org/jira/browse/ZOOKEEPER-2812 and 
https://issues.apache.org/jira/browse/ZOOKEEPER-2810, that result in the 
SessionTracker sometimes getting passed an empty HashMap instead of a correct 
listing of sessionsWithTimeouts on startup – which is a mistake on our side and 
now fixed successfully. Our recent transition from ZooKeeper 3.4 to 3.5 might 
have played a role as well.
 With a standalone ZooKeeper server or even cluster we have had sporadic 
reports of issues that probably were related to the persistence of stale 
ephemeral nodes as well. However, I have been unsuccessfully trying to 
reproduce the reported problem in a standalone Zookeeper server of version 
3.5.7 and have to conclude that at least the easily reproducible scenario 
described in this issue does not apply and that I have not yet found another 
such scenario.

Therefore I am sorry for having bothered you prematurely, as it seems to have 
been mainly a problem on our side. This issue can be closed and I will keep 
watching these types of problems in case we do observe them again at some point!

Just some final words on your observations: the lines
{code:java}
Ignoring processTxn failure hdr: -1, error: -110, path: null{code}
and
{code:java}
EOF exception java.io.EOFException: Failed to read 
/my/path/version-2/log.72{code}
still show up on startup, but do not seem related to the fixed problem in our 
embedded ZooKeeper instance.
 The line
{code:java}
ZKShutdownHandler is not registered, so ZooKeeper server won't take any action 
on ERROR or SHUTDOWN server state changes{code}
shows up in the logs because of the high log level (DEBUG) and because a 
{{ZKShutdownHandler}} is not/may not be registered if the user creates a 
{{ZooKeeperServer}} object outside of {{ZooKeeperServerMain.runFromConfig.}}
 The observed errors (3.), however, indeed seem to have been related to the 
described problem and are now gone.

Thank you again!


was (Author: lemora):
Sorry for taking so long for an answer and thank you for investigating!

The problem we observed this issue with is an embedded ZooKeeper. After 
investigations I discovered that it includes workarounds for issues 
https://issues.apache.org/jira/browse/ZOOKEEPER-2812 and 
https://issues.apache.org/jira/browse/ZOOKEEPER-2810, that result in the 
SessionTracker sometimes getting passed an empty HashMap instead of a correct 
listing of sessionsWithTimeouts on startup – which is a mistake on our side and 
now fixed successfully. Our recent transition from ZooKeeper 3.4 to 3.5 might 
have played a role as well.
With a standalone ZooKeeper server or even cluster we have had sporadic reports 
of issues that probably were related to the persistence of stale ephemeral 
nodes as well. However, I have been unsuccessfully trying to reproduce the 
reported problem in a standalone Zookeeper server of version 3.5.7 and have to 
conclude that at least the easily reproducible scenario described in this issue 
does not apply and that I have not yet found another such scenario.

Therefore I am sorry for having bothered you prematurely, as it seems to have 
been mainly a problem on our side. I will close this issue and keep watching 
these types of problems in case we do observe them again at some point!

Just some final words on your observations: the lines
{code:java}
Ignoring processTxn failure hdr: -1, error: -110, path: null{code}
and
{code:java}
EOF exception java.io.EOFException: Failed to read 
/my/path/version-2/log.72{code}
still show up on startup, but do not seem related to the fixed problem in our 
embedded ZooKeeper instance.
The line
{code:java}
ZKShutdownHandler is not registered, so ZooKeeper server won't take any action 
on ERROR or SHUTDOWN server state changes{code}
shows up in the logs because of the high log level (DEBUG) and because a 
{{ZKShutdownHandler}} is not/may not be registered if the user creates a 
{{ZooKeeperServer}} object outside of {{ZooKeeperServerMain.runFromConfig.}}
The observed errors (3.), however, indeed seem to have been related to the 
described problem and are now gone.

Thank you again!

> Ephemeral node not deleted after session is gone, then elected as leader
> ------------------------------------------------------------------------
>
>                 Key: ZOOKEEPER-3890
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3890
>             Project: ZooKeeper
>          Issue Type: Bug
>    Affects Versions: 3.4.14, 3.5.7
>            Reporter: Lea Morschel
>            Priority: Major
>         Attachments: cmdline-feedback.txt, zkLogsAndSnapshots.tar.xz
>
>
> When a ZooKeeper client session disappears, the associated ephemeral node 
> that is used for leader election is occasionally not deleted and persists 
> (indefinitely, it seems).
>  A leader election process may select such a stale node to be the leader. In 
> a scenario where there is a redundant service that takes action when 
> acquiring leadership by means of a ZooKeeper election process, this leads to 
> none of the services being active when the stale ephemeral node is elected.
> One of the scenarios where such a stale ephemeral node is created can be 
> triggered by force-killing the  ZooKeeper server ({{kill -9 <pid}}>) as well 
> as the client, which leads to the session being recreated after restarting 
> the server on its side, even though the actual client session is gone. This 
> node even persists after regular restarts from now on. No pings from its 
> owner-session are received, compared to an active one, yet the session never 
> expires. This scenario involves a single ZooKeeper server, but the problem 
> has also been observed in a cluster of three.
> When the ephemeral node is first persisted after restarting (and every 
> restart thereafter), the following is observable in the ZooKeeper server 
> logs. The scenario involves a local ZooKeeper server (version 3.5.7) and a 
> single leader election participant.
> {code:java}
> Opening datadir:/my/path snapDir:/my/path
> zookeeper.snapshot.trust.empty : true
> tickTime set to 2000
> minSessionTimeout set to 4000
> maxSessionTimeout set to 40000
> zookeeper.snapshotSizeFactor = 0.33
> Reading snapshot /my/path/version-2/snapshot.71
> Created new input stream /my/path/version-2/log.4b
> Created new input archive /my/path/version-2/log.4b
> EOF exception java.io.EOFException: Failed to read /my/path/version-2/log.4b
> Created new input stream /my/path/version-2/log.72
> Created new input archive /my/path/version-2/log.72
> Ignoring processTxn failure hdr: -1 : error: -110
> Ignoring processTxn failure hdr: -1, error: -110, path: null
> Ignoring processTxn failure hdr: -1 : error: -110
> Ignoring processTxn failure hdr: -1, error: -110, path: null
> Ignoring processTxn failure hdr: -1 : error: -110
> Ignoring processTxn failure hdr: -1, error: -110, path: null
> Ignoring processTxn failure hdr: -1 : error: -110
> Ignoring processTxn failure hdr: -1, error: -110, path: null
> Ignoring processTxn failure hdr: -1 : error: -110
> Ignoring processTxn failure hdr: -1, error: -110, path: null
> Ignoring processTxn failure hdr: -1 : error: -110
> Ignoring processTxn failure hdr: -1, error: -110, path: null
> EOF exception java.io.EOFException: Failed to read /my/path/version-2/log.72
> Snapshotting: 0x8b to /my/path/version-2/snapshot.8b
> ZKShutdownHandler is not registered, so ZooKeeper server won't take any 
> action on ERROR or SHUTDOWN server state changes
> autopurge.snapRetainCount set to 3
> autopurge.purgeInterval set to 3{code}
> Could this problem be solved by ZooKeeper checking the sessions for each 
> participating node before starting a leader election?
> So far only manual intervention (removing the stale ephemeral node) seems to 
> "fix" the issue temporarily.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to