[
https://issues.apache.org/jira/browse/ZOOKEEPER-2162?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Akihiro Suda updated ZOOKEEPER-2162:
------------------------------------
Description:
This sequence leads server.1 and server.2 to infinite exception loop.
* Start server.1 and server.2 with the initial ensemble server.1=participant,
server.2=observer.
In this time, acceptedEpoch\[i\] == currentEpoch\[i\] == 1 for i = 1, 2.
* Invoke reconfig so that acceptedEpoch\[i\] and currentEpoch\[i\] grows up to
2.
* Kill server.2
* Remove dataDir of server.2 excluding the myid file.
(In real production environments, both of confDir and dataDir can be lost
due to reprovisioning)
* Start server.2
* server.1 and server.2 enters infinite exception loop.
The log (threshold is set to INFO in log4j.properties) size can reach >
100MB in 30 seconds.
AFAIK, the bug can be reproduced with
ZooKeeper@f5fb50ed2591ba9a24685a227bb5374759516828 (Apr 7, 2015).
I made a Docker container so that people who are interested can reproduce the
bug easily. (Sorry for no JUnit test right now)
{noformat}
$ docker run -i -t --rm akihirosuda/zookeeper-bug01
Reproducing the bug: infinite exception loop occurs when dataDir is lost
* Resetting
* Starting [1,2] with the initial ensemble [1]
* Sleeping for 3 seconds
* Invoking Reconfig [1]->[2]
* Sleeping for 3 seconds
* Killing server.2 (pid=10542)
* Sleeping for 3 seconds
* Resetting /zk02_data
* Starting server.2
* Sleeping for 30 seconds
/zk01_log: 81665114 bytes
The log dir is extremely large. Perhaps the bug was REPRODUCED!
/zk02_log: 23949367 bytes
The log dir is extremely large. Perhaps the bug was REPRODUCED!
* Exiting
{noformat}
For details of the log, please refer to
https://github.com/AkihiroSuda/suda-pub/blob/master/dockerfiles/zookeeper-bug01/README.md
.
was:
This sequence leads server.1 and server.2 to infinite exception loop.
* Start server.1 and server.2 with the initial ensemble server.1=participant,
server.2=observer.
In this time, acceptedEpoch\[i\] == currentEpoch\[i\] == 1 for i = 1, 2.
* Invoke reconfig so that acceptedEpoch\[i\] and currentEpoch\[i\] grows up to
2.
* Kill server.2
* Remove dataDir of server.2 excluding the myid file.
(In real production environments, both of confDir and dataDir can be lost
due to reprovisioning)
* Start server.2
* server.1 and server.2 enters infinite exception loop.
The log (threshold is set to INFO in log4j.properties) size can reach >
100MB in 30 seconds.
AFAIK, the bug can be reproduced with
ZooKeeper@f5fb50ed2591ba9a24685a227bb5374759516828 (Apr 7, 2015).
I made a Docker container so that people who are interested can reproduce the
bug easily. (Sorry for no JUnit tests right now)
{noformat}
$ docker run -i -t --rm akihirosuda/zookeeper-bug01
Reproducing the bug: infinite exception loop occurs when dataDir is lost
* Resetting
* Starting [1,2] with the initial ensemble [1]
* Sleeping for 3 seconds
* Invoking Reconfig [1]->[2]
* Sleeping for 3 seconds
* Killing server.2 (pid=10542)
* Sleeping for 3 seconds
* Resetting /zk02_data
* Starting server.2
* Sleeping for 30 seconds
/zk01_log: 81665114 bytes
The log dir is extremely large. Perhaps the bug was REPRODUCED!
/zk02_log: 23949367 bytes
The log dir is extremely large. Perhaps the bug was REPRODUCED!
* Exiting
{noformat}
> infinite exception loop occurs when dataDir is lost
> ---------------------------------------------------
>
> Key: ZOOKEEPER-2162
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2162
> Project: ZooKeeper
> Issue Type: Bug
> Components: server
> Affects Versions: 3.5.0
> Reporter: Akihiro Suda
> Attachments: ZOOKEEPER-2162.patch
>
>
> This sequence leads server.1 and server.2 to infinite exception loop.
> * Start server.1 and server.2 with the initial ensemble
> server.1=participant, server.2=observer.
> In this time, acceptedEpoch\[i\] == currentEpoch\[i\] == 1 for i = 1, 2.
> * Invoke reconfig so that acceptedEpoch\[i\] and currentEpoch\[i\] grows up
> to 2.
> * Kill server.2
> * Remove dataDir of server.2 excluding the myid file.
> (In real production environments, both of confDir and dataDir can be lost
> due to reprovisioning)
> * Start server.2
> * server.1 and server.2 enters infinite exception loop.
> The log (threshold is set to INFO in log4j.properties) size can reach >
> 100MB in 30 seconds.
> AFAIK, the bug can be reproduced with
> ZooKeeper@f5fb50ed2591ba9a24685a227bb5374759516828 (Apr 7, 2015).
> I made a Docker container so that people who are interested can reproduce the
> bug easily. (Sorry for no JUnit test right now)
> {noformat}
> $ docker run -i -t --rm akihirosuda/zookeeper-bug01
> Reproducing the bug: infinite exception loop occurs when dataDir is lost
> * Resetting
> * Starting [1,2] with the initial ensemble [1]
> * Sleeping for 3 seconds
> * Invoking Reconfig [1]->[2]
> * Sleeping for 3 seconds
> * Killing server.2 (pid=10542)
> * Sleeping for 3 seconds
> * Resetting /zk02_data
> * Starting server.2
> * Sleeping for 30 seconds
> /zk01_log: 81665114 bytes
> The log dir is extremely large. Perhaps the bug was REPRODUCED!
> /zk02_log: 23949367 bytes
> The log dir is extremely large. Perhaps the bug was REPRODUCED!
> * Exiting
> {noformat}
> For details of the log, please refer to
> https://github.com/AkihiroSuda/suda-pub/blob/master/dockerfiles/zookeeper-bug01/README.md
> .
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)