thanks for the explanation! but how do you avoid having the scenario raised by John ? lets say you're a client connected to F, and F is connected to L. Lets also say that L's pipeline is now empty, and both F and L are partitioned from 3 other servers in the system that have already elected a new leader L'. Now I go to L' and write something. L still thinks its the leader because the detection that followers left it is obviously timeout dependent. So when F sends your sync to L and L returns it to F, you actually miss my write!
Alex On Thu, Sep 27, 2012 at 12:32 AM, Flavio Junqueira <[email protected]> wrote: > Hi Alex, Because of the following: > > 1- A follower F processes operations from a client in FIFO order, and say > that a client submits as you say sync + read; > 2- A sync will be processed by the leader and returned to the follower. It > will be queued after all pending updates that the follower hasn't processed; > 3- The follower will process all pending updates before processing the > response of the sync; > 4- Once the follower processes the sync, it picks the read operation to > process. It reads the local state of the follower and returns to the client. > > When we process the read in Step 4, we have applied all pending updates the > leader had for the follower by the time the read request started. > > This implementation is a bit of a hack because it doesn't follow the same > code path as the other operations that go to the leader, but it avoids some > unnecessary steps, which is important for fast reads. In the sync case, the > other followers don't really need to know about it (there is nothing to be > updated) and the leader simply inserts it in the sequence of updates of F, > ordering it. > > -Flavio > > On Sep 27, 2012, at 9:12 AM, Alexander Shraer wrote: > >> Hi Flavio, >> >>> Starting a read operation concurrently with a sync implies that the result >>> of the read will not miss an update committed before the read started. >> >> I thought that the intention of sync was to give something like >> linearizable reads, so if you invoke a sync and then a read, your read >> is guaranteed to (at least) see any write which completed before the >> sync began. Is this the intention ? If so, how is this achieved >> without running agreement on the sync op ? >> >> Thanks, >> Alex >> >> On Thu, Sep 27, 2012 at 12:05 AM, Flavio Junqueira <[email protected]> >> wrote: >>> sync simply flushes the channel between the leader and the follower that >>> forwarded the sync operation, so it doesn't go through the full zab >>> pipeline. Flushing means that all pending updates from the leader to the >>> follower are received by the time sync completes. Starting a read operation >>> concurrently with a sync implies that the result of the read will not miss >>> an update committed before the read started. >>> >>> -Flavio >>> >>> On Sep 27, 2012, at 3:43 AM, Alexander Shraer wrote: >>> >>>> Its strange that sync doesn't run through agreement, I was always >>>> assuming that it is... Exactly for the reason you say - >>>> you may trust your leader, but I may have a different leader and your >>>> leader may not detect it yet and still think its the leader. >>>> >>>> This seems like a bug to me. >>>> >>>> Similarly to Paxos, Zookeeper's safety guarantees don't (or shouldn't) >>>> depend on timing assumption. >>>> Only progress guarantees depend on time. >>>> >>>> Alex >>>> >>>> >>>> On Wed, Sep 26, 2012 at 4:41 PM, John Carrino <[email protected]> >>>> wrote: >>>>> I have some pretty strong requirements in terms of consistency where >>>>> reading from followers that may be behind in terms of updates isn't ok for >>>>> my use case. >>>>> >>>>> One error case that worries me is if a follower and leader are partitioned >>>>> off from the network. A new leader is elected, but the follower and old >>>>> leader don't know about it. >>>>> >>>>> Normally I think sync was made for this purpost, but I looked at the sync >>>>> code and if there aren't any outstanding proposals the leader sends the >>>>> sync right back to the client without first verifying that it still has >>>>> quorum, so this won't work for my use case. >>>>> >>>>> At the core of the issue all I really need is a call that will make it's >>>>> way to the leader and will ping it's followers, ensure it still has a >>>>> quorum and return success. >>>>> >>>>> Basically a getCurrentLeaderEpoch() method that will be forwarded to the >>>>> leader, leader will ensure it still has quorum and return it's epoch. I >>>>> can use this primitive to implement all the other properties I want to >>>>> verify (assuming that my client will never connect to an older epoch after >>>>> this call returns). Also the nice thing about this method is that it will >>>>> not have to hit disk and the latency should just be a round trip to the >>>>> followers. >>>>> >>>>> Most of the guarentees offered by zookeeper are time based an rely on >>>>> clocks and expiring timers, but I'm hoping to offer some guarantees in >>>>> spite of busted clocks, horrible GC perf, VM suspends and any other way >>>>> time is broken. >>>>> >>>>> Also if people are interested I can go into more detail about what I am >>>>> trying to write. >>>>> >>>>> -jc >>> >
