[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-3890?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lea Morschel updated ZOOKEEPER-3890:
------------------------------------
    Description: 
When a ZooKeeper client session disappears, the associated ephemeral node that 
is used for leader election is occasionally not deleted and persists 
(indefinitely, it seems).
 This of course leads to a leader election process frequently selecting such a 
stale node to be the leader because it is oldest, so that none of the existent 
redundant services that take action when acquiring leadership will do so.

One of the scenarios where such a stale ephemeral node is created can be 
triggered by force-killing both the client and  ZooKeeper server ({{kill -9 
<pid}}>), which leads to the session being recreated after restarting the 
server on its side, even though the actual client session is gone. This node 
even persists after regular restarts from now on. No pings from this session 
are received, compared to the active one, yet the session never expires. This 
scenario involves a single ZooKeeper server, but the problem has also been 
observed in a cluster of three.

When the ephemeral node is first persisted after restarting (and every restart 
thereafter), the following is observable in the ZooKeeper server logs:
{code:java}
Opening datadir:/my/path snapDir:/my/path
zookeeper.snapshot.trust.empty : true
tickTime set to 2000
minSessionTimeout set to 4000
maxSessionTimeout set to 40000
zookeeper.snapshotSizeFactor = 0.33
Reading snapshot /my/path/version-2/snapshot.71
Created new input stream /my/path/version-2/log.4b
Created new input archive /my/path/version-2/log.4b
EOF exception java.io.EOFException: Failed to read /my/path/version-2/log.4b
Created new input stream /my/path/version-2/log.72
Created new input archive /my/path/version-2/log.72
Ignoring processTxn failure hdr: -1 : error: -110
Ignoring processTxn failure hdr: -1, error: -110, path: null
Ignoring processTxn failure hdr: -1 : error: -110
Ignoring processTxn failure hdr: -1, error: -110, path: null
Ignoring processTxn failure hdr: -1 : error: -110
Ignoring processTxn failure hdr: -1, error: -110, path: null
Ignoring processTxn failure hdr: -1 : error: -110
Ignoring processTxn failure hdr: -1, error: -110, path: null
Ignoring processTxn failure hdr: -1 : error: -110
Ignoring processTxn failure hdr: -1, error: -110, path: null
Ignoring processTxn failure hdr: -1 : error: -110
Ignoring processTxn failure hdr: -1, error: -110, path: null
EOF exception java.io.EOFException: Failed to read /my/path/version-2/log.72
Snapshotting: 0x8b to /my/path/version-2/snapshot.8b
ZKShutdownHandler is not registered, so ZooKeeper server won't take any action 
on ERROR or SHUTDOWN server state changes
autopurge.snapRetainCount set to 3
autopurge.purgeInterval set to 3{code}

  was:
When a ZooKeeper client session disappears, the associated ephemeral node that 
is used for leader election is occasionally not deleted and persists 
(indefinitely, it seems).
 This of course leads to a leader election process frequently selecting such a 
stale node to be the leader because it is oldest, so that none of the existent 
redundant services that take action when acquiring leadership will do so.

One of the scenarios where such a stale ephemeral node is created can be 
triggered by force-killing both the client and  ZooKeeper server ({{kill -9 
<pid}}>), which leads to the session being recreated after restarting the 
server on its side, even though the actual client session is gone. This node 
even persists after regular restarts from now on. This scenario involves a 
single ZooKeeper server, but the problem has also been observed in a cluster of 
three.

When the ephemeral node is first persisted after restarting (and every restart 
thereafter), the following is observable in the ZooKeeper server logs:
{code:java}
Opening datadir:/my/path snapDir:/my/path
zookeeper.snapshot.trust.empty : true
tickTime set to 2000
minSessionTimeout set to 4000
maxSessionTimeout set to 40000
zookeeper.snapshotSizeFactor = 0.33
Reading snapshot /my/path/version-2/snapshot.71
Created new input stream /my/path/version-2/log.4b
Created new input archive /my/path/version-2/log.4b
EOF exception java.io.EOFException: Failed to read /my/path/version-2/log.4b
Created new input stream /my/path/version-2/log.72
Created new input archive /my/path/version-2/log.72
Ignoring processTxn failure hdr: -1 : error: -110
Ignoring processTxn failure hdr: -1, error: -110, path: null
Ignoring processTxn failure hdr: -1 : error: -110
Ignoring processTxn failure hdr: -1, error: -110, path: null
Ignoring processTxn failure hdr: -1 : error: -110
Ignoring processTxn failure hdr: -1, error: -110, path: null
Ignoring processTxn failure hdr: -1 : error: -110
Ignoring processTxn failure hdr: -1, error: -110, path: null
Ignoring processTxn failure hdr: -1 : error: -110
Ignoring processTxn failure hdr: -1, error: -110, path: null
Ignoring processTxn failure hdr: -1 : error: -110
Ignoring processTxn failure hdr: -1, error: -110, path: null
EOF exception java.io.EOFException: Failed to read /my/path/version-2/log.72
Snapshotting: 0x8b to /my/path/version-2/snapshot.8b
ZKShutdownHandler is not registered, so ZooKeeper server won't take any action 
on ERROR or SHUTDOWN server state changes
autopurge.snapRetainCount set to 3
autopurge.purgeInterval set to 3{code}


> Ephemeral node not deleted after session is gone, then elected as leader
> ------------------------------------------------------------------------
>
>                 Key: ZOOKEEPER-3890
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3890
>             Project: ZooKeeper
>          Issue Type: Bug
>    Affects Versions: 3.5.7
>            Reporter: Lea Morschel
>            Priority: Major
>         Attachments: cmdline-feedback.txt
>
>
> When a ZooKeeper client session disappears, the associated ephemeral node 
> that is used for leader election is occasionally not deleted and persists 
> (indefinitely, it seems).
>  This of course leads to a leader election process frequently selecting such 
> a stale node to be the leader because it is oldest, so that none of the 
> existent redundant services that take action when acquiring leadership will 
> do so.
> One of the scenarios where such a stale ephemeral node is created can be 
> triggered by force-killing both the client and  ZooKeeper server ({{kill -9 
> <pid}}>), which leads to the session being recreated after restarting the 
> server on its side, even though the actual client session is gone. This node 
> even persists after regular restarts from now on. No pings from this session 
> are received, compared to the active one, yet the session never expires. This 
> scenario involves a single ZooKeeper server, but the problem has also been 
> observed in a cluster of three.
> When the ephemeral node is first persisted after restarting (and every 
> restart thereafter), the following is observable in the ZooKeeper server logs:
> {code:java}
> Opening datadir:/my/path snapDir:/my/path
> zookeeper.snapshot.trust.empty : true
> tickTime set to 2000
> minSessionTimeout set to 4000
> maxSessionTimeout set to 40000
> zookeeper.snapshotSizeFactor = 0.33
> Reading snapshot /my/path/version-2/snapshot.71
> Created new input stream /my/path/version-2/log.4b
> Created new input archive /my/path/version-2/log.4b
> EOF exception java.io.EOFException: Failed to read /my/path/version-2/log.4b
> Created new input stream /my/path/version-2/log.72
> Created new input archive /my/path/version-2/log.72
> Ignoring processTxn failure hdr: -1 : error: -110
> Ignoring processTxn failure hdr: -1, error: -110, path: null
> Ignoring processTxn failure hdr: -1 : error: -110
> Ignoring processTxn failure hdr: -1, error: -110, path: null
> Ignoring processTxn failure hdr: -1 : error: -110
> Ignoring processTxn failure hdr: -1, error: -110, path: null
> Ignoring processTxn failure hdr: -1 : error: -110
> Ignoring processTxn failure hdr: -1, error: -110, path: null
> Ignoring processTxn failure hdr: -1 : error: -110
> Ignoring processTxn failure hdr: -1, error: -110, path: null
> Ignoring processTxn failure hdr: -1 : error: -110
> Ignoring processTxn failure hdr: -1, error: -110, path: null
> EOF exception java.io.EOFException: Failed to read /my/path/version-2/log.72
> Snapshotting: 0x8b to /my/path/version-2/snapshot.8b
> ZKShutdownHandler is not registered, so ZooKeeper server won't take any 
> action on ERROR or SHUTDOWN server state changes
> autopurge.snapRetainCount set to 3
> autopurge.purgeInterval set to 3{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to