[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2162?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Akihiro Suda updated ZOOKEEPER-2162:
------------------------------------
    Attachment: ZOOKEEPER-2162.patch

A naive patch for ZOOKEEPER-2162.
This shutdowns server when leader's epoch < accepted epoch.


> infinite exception loop occurs when dataDir is lost
> ---------------------------------------------------
>
>                 Key: ZOOKEEPER-2162
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2162
>             Project: ZooKeeper
>          Issue Type: Bug
>          Components: server
>    Affects Versions: 3.5.0
>            Reporter: Akihiro Suda
>         Attachments: ZOOKEEPER-2162.patch
>
>
> This sequence leads server.1 and server.2 to infinite exception loop.
>  * Start server.1 and server.2 with the initial ensemble 
> server.1=participant, server.2=observer.
>    In this time, acceptedEpoch\[i\] == currentEpoch\[i\] == 1 for i = 1, 2.
>  * Invoke reconfig so that acceptedEpoch\[i\] and currentEpoch\[i\] grows up 
> to 2.
>  * Kill server.2
>  * Remove dataDir of server.2 excluding the myid file.
>    (In real production environments, both of confDir and dataDir can be lost 
> due to reprovisioning)
>  * Start server.2
>  * server.1 and server.2 enters infinite exception loop.
>    The log (threshold is set to INFO in log4j.properties) size can reach > 
> 100MB in 30 seconds.
> AFAIK, the bug can be reproduced with 
> ZooKeeper@f5fb50ed2591ba9a24685a227bb5374759516828 (Apr 7, 2015).
> I made a Docker container so that people who are interested can reproduce the 
> bug easily. (Sorry for no JUnit tests right now)
> {noformat}
> $ docker run -i -t --rm akihirosuda/zookeeper-bug01
> Reproducing the bug: infinite exception loop occurs when dataDir is lost
> * Resetting
> * Starting [1,2] with the initial ensemble [1]
> * Sleeping for 3 seconds
> * Invoking Reconfig [1]->[2]
> * Sleeping for 3 seconds
> * Killing server.2 (pid=10542)
> * Sleeping for 3 seconds
> * Resetting /zk02_data
> * Starting server.2
> * Sleeping for 30 seconds
> /zk01_log: 81665114 bytes
> The log dir is extremely large. Perhaps the bug was REPRODUCED!
> /zk02_log: 23949367 bytes
> The log dir is extremely large. Perhaps the bug was REPRODUCED!
> * Exiting
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to