[
https://issues.apache.org/jira/browse/ZOOKEEPER-962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12979267#action_12979267
]
Benjamin Reed commented on ZOOKEEPER-962:
-----------------------------------------
i gave reviewboard a try: https://reviews.apache.org/r/264/ there are a couple
of minor points there
one thing i'm not quite clear on is how moving to a read/write lock helps.
i think there is another problem that i think is more related to 919, but since
both issues hit them same piece of code, it would probably be good to fix them
together. in the case of SNAP, we send a snapshot and then we send diffs to
that snapshot, but imagine the following: we start taking a snapshot, S, and
while we take the snapshot Z1, Z2, Z3, and Z4 come in; let's say that the
snapshot finishes right after Z4. the follower will get S, and then it will get
Z1, which it will log. if the follower, fails and comes back up, it will
believe that it has the state up to Z1, which it does not infact have.
to avoid this problem, i think we should receive S into memory, save off Z1-Z4
into a log in memory, committing them to S if we get a commit, and then when we
get an UPTODATE we write out the snapshot and Z1-Z4, and ack Z1-Z4. we can do
all this in syncWithLeader and not only the issue i'm pointing out here will be
fixed, but also the concurrency issue that you have identified here will go
away.
what do you think?
> leader/follower coherence issue when follower is receiving a DIFF
> -----------------------------------------------------------------
>
> Key: ZOOKEEPER-962
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-962
> Project: ZooKeeper
> Issue Type: Bug
> Components: server
> Affects Versions: 3.3.2
> Reporter: Camille Fournier
> Assignee: Camille Fournier
> Priority: Critical
> Fix For: 3.3.3, 3.4.0
>
> Attachments: ZOOKEEPER-962.patch
>
>
> From mailing list:
> It seems like we rely on the LearnerHandler thread startup to capture all of
> the missing committed
> transactions in the SNAP or DIFF, but I don't see anything (especially in the
> DIFF case) that
> is preventing us for committing more transactions before we actually start
> forwarding updates
> to the new follower.
> Let me explain using my example from ZOOKEEPER-919. Assume we have quorum
> already, so the
> leader can be processing transactions while my follower is starting up.
> I'm a follower at zxid N-5, the leader is at N. I send my FOLLOWERINFO packet
> to the leader
> with that information. The leader gets the proposals from its committed log
> (time T1), then
> syncs on the proposal list (LearnerHandler line 267. Why? It's a copy of the
> underlying proposal
> list... this might be part of our problem). I check to see if the
> peerLastZxid is within my
> max and min committed log and it is, so I'm going to send a diff. I set the
> zxidToSend to
> be the maxCommittedLog at time T3 (we already know this is sketchy), and
> forward the proposals
> from my copied proposal list starting at the peerLastZxid+1 up to the last
> proposal transaction
> (as seen at time T1).
> After I have queued up all those diffs to send, I tell the leader to
> startFowarding updates
> to this follower (line 308).
> So, let's say that at time T2 I actually swap out the leader to the thread
> that is handling
> the various request processors, and see that I got enough votes to commit
> zxid N+1. I commit
> N+1 and so my maxCommittedLog at T3 is N+1, but this proposal is not in the
> list of proposals
> that I got back at time T1, so I don't forward this diff to the client.
> Additionally, I processed
> the commit and removed it from my leader's toBeApplied list. So when I call
> startForwarding
> for this new follower, I don't see this transaction as a transaction to be
> forwarded.
> There's one problem. Let's also imagine, however, that I commit N+1 at time
> T4. The maxCommittedLog
> value is consistent with the max of the diff packets I am going to send the
> follower. But,
> I still committed N+1 and removed it from the toBeApplied list before calling
> startFowarding
> with this follower. How does the follower get this transaction? Does it?
> To put it another way, here is the thread interaction, hopefully formatted so
> you can read
> it...
> LearnerHandlerThread
> RequestProcessorThread
> T1(LH): get list of proposals (COPY)
> T2(RPT): commit
> N+1, remove from toBeApplied
> T3(LH): get maxCommittedLog
> T4(LH): send diffs from view at T1
> T5(LH): startForwarding
> Or
> T1(LH): get list of proposals (COPY)
> T2(LH): get maxCommittedLog
> T3(RPT): commit
> N+1, remove from toBeApplied
> T4(LH): send diffs from view at T1
> T5(LH): startFowarding
> I'm trying to figure out what, if anything, keeps the requests from being
> committed, removed,
> and never seen by the follower before it fully starts up.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.