[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1489?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Hunt updated ZOOKEEPER-1489:
------------------------------------

    Attachment: ZOOKEEPER-1489.patch

This update adds more logging, I also believe it fixes the test failure we were 
seeing. The main issue was that the port forwarder was not attempting to 
reconnect on outbound connection failure. In the case of follower connecting to 
the quorum port on the leader it might take some time for the leader to open 
the port once the election has completed. The core follower code retries this. 
However in the port forwarder case it would originally try once and close the 
inbound connection if the outbound failed. I now retry 10 times.

I managed to find a box which would fail on this test (my laptop and a number 
of other boxes didn't display this issue). With this latest fix the previously 
failing box is now passing the test consistently.
                
> Data loss after truncate on transaction log
> -------------------------------------------
>
>                 Key: ZOOKEEPER-1489
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1489
>             Project: ZooKeeper
>          Issue Type: Bug
>          Components: server
>    Affects Versions: 3.4.3, 3.3.5
>         Environment: Tested on Ubuntu 12.04 and CentOS 6, should be 
> reproducible elsewhere
>            Reporter: Christian Ziech
>            Assignee: Patrick Hunt
>            Priority: Blocker
>             Fix For: 3.3.6, 3.4.4, 3.5.0
>
>         Attachments: TruncateTxLogCorruption.tgz, 
> TruncateTxLogCorruption.tgz, ZOOKEEPER-1489.patch, ZOOKEEPER-1489.patch, 
> ZOOKEEPER-1489.patch, ZOOKEEPER-1489.patch, ZOOKEEPER-1489_br33.patch, 
> ZOOKEEPER-1489_br33.patch, ZOOKEEPER-1489_br33.patch, 
> ZOOKEEPER-1489_br33.patch, ZOOKEEPER-1489_br34.patch, 
> ZOOKEEPER-1489_br34.patch, ZOOKEEPER-1489_br34.patch, 
> ZOOKEEPER-1489_br34.patch
>
>
> The truncate method on the transaction log in the class 
> org.apache.zookeeper.server.persistence.FileTxnLog will reduce the file size 
> to the required amount without either closing or re-positioning the logStream 
> (which could also be dangerous since the truncate method is not synchronized 
> against concurrent writes to the log).
> This causes the next append to that log to create a small "hole" in the file 
> which java would interpret as binary zeroes when reading it. This then causes 
> to the FileTxnIterator.next() implementation to detect the end of the log 
> file too early.
> I'll attach a small maven project with one junit test which can be used to 
> reproduce the issue. Due to the blackbox nature of the test it will run for 
> roughly 50 seconds unfortunately. 
> Steps to reproduce:
> - Start an ensemble of zookeeper servers with at least 3 participants
> - Create one entry and the remove one of the servers from the ensemble 
> temporarily (e.g. zk-2)
> - Create another entry which is hence only reflected on zk-1 and zk-3
> - Take zk-1 out of the ensemble without shutting it down (that is important, 
> I did that by interrupting the network connection to that node) and clean zk-3
> - Bring back zk-2 and zk-3 so that they form a quorum
> - Allow zk-1 to connect again
> - zk-1 will receive a TRUNC message from zk-2 since zk-1 is now a minority 
> knowing about that second node creation event
> - Create a third node
> - Force zk-1 to become master somehow
> - That third node will be gone

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to