Santeri (Santtu) Voutilainen created ZOOKEEPER-2099:
-------------------------------------------------------

             Summary: Using txnlog to sync a learner can corrupt the learner's 
datatree
                 Key: ZOOKEEPER-2099
                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2099
             Project: ZooKeeper
          Issue Type: Bug
          Components: server
    Affects Versions: 3.5.0, 3.6.0
            Reporter: Santeri (Santtu) Voutilainen


When a learner sync's with the leader, it is possible for the Leader to send 
the learner a DIFF that does NOT contain all the transactions between the 
learner's zxid and that of the leader's zxid thus resulting in a corruption 
datatree on the learner.
For this to occur, the leader must have sync'd with a previous leader using a 
SNAP and the zxid requested by the learner must still exist in the current 
leader's txnlog files.
This issue was introduced by ZOOKEEPER-1413.

*Scenario*
A sample sequence in which this issue occurs:
# Hosts H1 and H2 disconnect from the current leader H3 (crash, network 
partition, etc).  The last zxid on these hosts is Z1.
# Additional transactions occur on the cluster resulting in the latest zxid 
being Z2.
# Host H1 recovers and connects to H3 to sync and sends Z1 as part of its 
FOLLOWERINFO or OBSERVERINFO packet.
# The leader, H3, decides to send a SNAP because a) it does not have the 
necessary records in the in-mem committed log, AND b) the size of the required 
txnlog to send it larger than the limit.
# Host H1 successfully sync's with the leader (H3). At this point H1's txnlogs 
have records up to and including Z1 as well as Z2 and up.  It does NOT have 
records between Z1 and Z2.
# Host H3 fails; a leader election occurs and H1 is chosen as the leader
# Host H2 recovers and connects to H2 to sync and sends Z1 in its 
FOLLOWERINFO/OBSERVERINFO packet
# The leader, H1, determines it can send a DIFF.  It concludes this because 
although it does not have the necessary records in its in-memory commit log, it 
does have Z1 in its txnlog and the size of the log is less than the limit.  H1 
ends up with a different size calculation than H3 because H1 is missing all the 
records between Z1 and Z2 so it has less log to send.
# H2 receives the DIFF and applies the records to its data tree. Depending on 
the type of transactions that occurred between Z1 and Z2 it may not hit any 
errors when applying these records.

H2 now has a corrupted view of the data tree because it is missing all the 
changes made by the transactions between Z1 and Z2.

*Recovery*
The way to recover from this situation is to delete the data/snap directory 
contents from the affected hosts and have them resync with the leader at which 
point they will receive a SNAP since they will appear as empty hosts.

*Workaround*
A quick workaround for anyone concerned about this issue is to disable sync 
from the txnlog by changing the database size limit to 0.  This is a code 
change as it is not a configurable setting.

*Potential fixes*
There are several ways of fixing this.  A few of options:
* Delete all snaps and txnlog files on a host when it receives a SNAP from the 
leader
* Invalidate sync from txnlog after receiving a SNAP. This state must also be 
persisted on-disk so that the txnlogs with the gap cannot be used to provide a 
DIFF even after restart.  A couple ways in which the state could be persisted:
** Write a file (for example: loggap.<zxid>) in the data dir indicating that 
the host was sync'd with a SNAP and thus txnlogs might be missing. Presence of 
these files would be checked when reading txnlogs.
** Write a new record into the txnlog file as "sync'd-by-snap-from-leader" 
marker. Readers of the txnlog would then check for presence of this record when 
iterating through it and act appropriately.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to