[
https://issues.apache.org/jira/browse/ZOOKEEPER-1549?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15306815#comment-15306815
]
Flavio Junqueira commented on ZOOKEEPER-1549:
---------------------------------------------
bq. "Snapshots are simply compacted versions of the txn log history, as applied
to the DataTree."
This is partially right and I acknowledge that my statement isn't that precise
either. The root of the problem is that sometimes we treat snapshots as having
only committed data (during the broadcast phase) and other times we produce
snapshots that have uncommitted data (during recovery). Whatever fix we have
needs to be such either snapshots contain committed data only or we enable
snapshots to be deleted. I don't personally like the idea of deleting snapshots
because if we don't get it right, then we will be making the zk state
inconsistent by losing quorum on some transactions. In fact, one scenario to be
aware of is the one in which an earlier snapshot has been committed while a
more recent one hasn't. If we wipe out the former snapshot, then we are in
trouble.
The one important thing to keep in mind is that we can't truncate a snapshot,
so either we have only committed state in snapshots so that we never fall into
this situation of having to truncate a snapshot or we start deleting snapshots
as a brute-force way of truncating, but in this latter case, we need to be
really careful.
bq. One issue to account for in the fix is the case where there is no earlier
snapshot to rebuild from.
I'm not concerned about this case because we want the leader to transfer data
to the learner. I'm clearly assuming that if any earlier snapshot has been
committed by the same learner, then we aren't going to discard that snapshot.
bq. the snapshot on the learner node that a TRUNC would have deleted will still
be present on the learner node, but it will no longer be the newest snapshot.
If the leaner has a newer valid snapshot, then it shouldn't be a problem.
However, if the learner has to load, then it will have to treat it as if it
were the latest (which probably is, latest valid)
.
bq. there were some concerns raised early in the comment thread that deleting
snapshots might be too aggressive
One important issue to keep in mind is that a quorum might commit a snapshot
and I'm concern that if we start deleting snapshots, we will end up with bugs
where we delete snapshots incorrectly.
My ideal approach is:
# Snapshots only contain committed data
# In the case a leader needs to transfer a snapshot, then transfer the latest
snapshot containing the committed data and a suffix from the log.
However, it is sounding a bit difficult to do it in a backwards compatible
manner.
> Data inconsistency when follower is receiving a DIFF with a dirty snapshot
> --------------------------------------------------------------------------
>
> Key: ZOOKEEPER-1549
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1549
> Project: ZooKeeper
> Issue Type: Bug
> Components: quorum
> Affects Versions: 3.4.3
> Reporter: Jacky007
> Assignee: Flavio Junqueira
> Priority: Blocker
> Fix For: 3.5.2, 3.6.0
>
> Attachments: ZOOKEEPER-1549-3.4.patch, ZOOKEEPER-1549-learner.patch,
> case.patch
>
>
> the trunc code (from ZOOKEEPER-1154?) cannot work correct if the snapshot is
> not correct.
> here is scenario(similar to 1154):
> Initial Condition
> 1. Lets say there are three nodes in the ensemble A,B,C with A being the
> leader
> 2. The current epoch is 7.
> 3. For simplicity of the example, lets say zxid is a two digit number,
> with epoch being the first digit.
> 4. The zxid is 73
> 5. All the nodes have seen the change 73 and have persistently logged it.
> Step 1
> Request with zxid 74 is issued. The leader A writes it to the log but there
> is a crash of the entire ensemble and B,C never write the change 74 to their
> log.
> Step 2
> A,B restart, A is elected as the new leader, and A will load data and take a
> clean snapshot(change 74 is in it), then send diff to B, but B died before
> sync with A. A died later.
> Step 3
> B,C restart, A is still down
> B,C form the quorum
> B is the new leader. Lets say B minCommitLog is 71 and maxCommitLog is 73
> epoch is now 8, zxid is 80
> Request with zxid 81 is successful. On B, minCommitLog is now 71,
> maxCommitLog is 81
> Step 4
> A starts up. It applies the change in request with zxid 74 to its in-memory
> data tree
> A contacts B to registerAsFollower and provides 74 as its ZxId
> Since 71<=74<=81, B decides to send A the diff.
> Problem:
> The problem with the above sequence is that after truncate the log, A will
> load the snapshot again which is not correct.
> In 3.3 branch, FileTxnSnapLog.restore does not call listener(ZOOKEEPER-874),
> the leader will send a snapshot to follower, it will not be a problem.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)