Thanks for the explanation. I guess one could always invoke a write operation instead of sync to get the more strict semantics, but as John suggests, it might be a good idea to add a new type of operation that requires followers to ack but doesn't require them to log to disk - this seems sufficient in our case.
Alex On Thu, Sep 27, 2012 at 3:56 AM, Flavio Junqueira <[email protected]> wrote: > In theory, the scenario you're describing could happen, but I would argue > that it is unlikely given that: 1) a leader pings followers twice a tick to > make sure that it has a quorum of supporters (lead()); 2) followers give up > on a leader upon catching an exception (followLeader()). One could calibrate > tickTime to make the probability of having this scenario low. > > Let me also revisit the motivation for the way we designed sync. ZooKeeper > has been designed to serve reads efficiently and making sync go through the > pipeline would slow down reads. Although optional, we thought it would be a > good idea to make it as efficient as possible to comply with the original > expectations for the service. We consequently came up with this cheap way of > making sure that a read sees all pending updates. It is correct that there > are some corner cases that it doesn't cover. One is the case you mentioned. > Another is having the sync finishing before the client submits the read and > having a write committing in between. We rely upon the way we implement > timeouts and some minimum degree of synchrony for the clients when submitting > operations to guarantee that the scheme work. > > We thought about the option of having the sync operation going through the > pipeline, and in fact it would have been easier to implement it just as a > regular write, but we opted not to because we felt it was sufficient for the > use cases we had and more efficient as I already argued. > > Hope it helps to clarify. > > -Flavio > > On Sep 27, 2012, at 9:38 AM, Alexander Shraer wrote: > >> thanks for the explanation! but how do you avoid having the scenario >> raised by John ? >> lets say you're a client connected to F, and F is connected to L. Lets >> also say that L's pipeline >> is now empty, and both F and L are partitioned from 3 other servers in >> the system that have already >> elected a new leader L'. Now I go to L' and write something. L still >> thinks its the leader because the >> detection that followers left it is obviously timeout dependent. So >> when F sends your sync to L and L returns >> it to F, you actually miss my write! >> >> Alex >> >> On Thu, Sep 27, 2012 at 12:32 AM, Flavio Junqueira <[email protected]> >> wrote: >>> Hi Alex, Because of the following: >>> >>> 1- A follower F processes operations from a client in FIFO order, and say >>> that a client submits as you say sync + read; >>> 2- A sync will be processed by the leader and returned to the follower. It >>> will be queued after all pending updates that the follower hasn't processed; >>> 3- The follower will process all pending updates before processing the >>> response of the sync; >>> 4- Once the follower processes the sync, it picks the read operation to >>> process. It reads the local state of the follower and returns to the client. >>> >>> When we process the read in Step 4, we have applied all pending updates the >>> leader had for the follower by the time the read request started. >>> >>> This implementation is a bit of a hack because it doesn't follow the same >>> code path as the other operations that go to the leader, but it avoids some >>> unnecessary steps, which is important for fast reads. In the sync case, the >>> other followers don't really need to know about it (there is nothing to be >>> updated) and the leader simply inserts it in the sequence of updates of F, >>> ordering it. >>> >>> -Flavio >>> >>> On Sep 27, 2012, at 9:12 AM, Alexander Shraer wrote: >>> >>>> Hi Flavio, >>>> >>>>> Starting a read operation concurrently with a sync implies that the >>>>> result of the read will not miss an update committed before the read >>>>> started. >>>> >>>> I thought that the intention of sync was to give something like >>>> linearizable reads, so if you invoke a sync and then a read, your read >>>> is guaranteed to (at least) see any write which completed before the >>>> sync began. Is this the intention ? If so, how is this achieved >>>> without running agreement on the sync op ? >>>> >>>> Thanks, >>>> Alex >>>> >>>> On Thu, Sep 27, 2012 at 12:05 AM, Flavio Junqueira <[email protected]> >>>> wrote: >>>>> sync simply flushes the channel between the leader and the follower that >>>>> forwarded the sync operation, so it doesn't go through the full zab >>>>> pipeline. Flushing means that all pending updates from the leader to the >>>>> follower are received by the time sync completes. Starting a read >>>>> operation concurrently with a sync implies that the result of the read >>>>> will not miss an update committed before the read started. >>>>> >>>>> -Flavio >>>>> >>>>> On Sep 27, 2012, at 3:43 AM, Alexander Shraer wrote: >>>>> >>>>>> Its strange that sync doesn't run through agreement, I was always >>>>>> assuming that it is... Exactly for the reason you say - >>>>>> you may trust your leader, but I may have a different leader and your >>>>>> leader may not detect it yet and still think its the leader. >>>>>> >>>>>> This seems like a bug to me. >>>>>> >>>>>> Similarly to Paxos, Zookeeper's safety guarantees don't (or shouldn't) >>>>>> depend on timing assumption. >>>>>> Only progress guarantees depend on time. >>>>>> >>>>>> Alex >>>>>> >>>>>> >>>>>> On Wed, Sep 26, 2012 at 4:41 PM, John Carrino <[email protected]> >>>>>> wrote: >>>>>>> I have some pretty strong requirements in terms of consistency where >>>>>>> reading from followers that may be behind in terms of updates isn't ok >>>>>>> for >>>>>>> my use case. >>>>>>> >>>>>>> One error case that worries me is if a follower and leader are >>>>>>> partitioned >>>>>>> off from the network. A new leader is elected, but the follower and old >>>>>>> leader don't know about it. >>>>>>> >>>>>>> Normally I think sync was made for this purpost, but I looked at the >>>>>>> sync >>>>>>> code and if there aren't any outstanding proposals the leader sends the >>>>>>> sync right back to the client without first verifying that it still has >>>>>>> quorum, so this won't work for my use case. >>>>>>> >>>>>>> At the core of the issue all I really need is a call that will make it's >>>>>>> way to the leader and will ping it's followers, ensure it still has a >>>>>>> quorum and return success. >>>>>>> >>>>>>> Basically a getCurrentLeaderEpoch() method that will be forwarded to the >>>>>>> leader, leader will ensure it still has quorum and return it's epoch. I >>>>>>> can use this primitive to implement all the other properties I want to >>>>>>> verify (assuming that my client will never connect to an older epoch >>>>>>> after >>>>>>> this call returns). Also the nice thing about this method is that it >>>>>>> will >>>>>>> not have to hit disk and the latency should just be a round trip to the >>>>>>> followers. >>>>>>> >>>>>>> Most of the guarentees offered by zookeeper are time based an rely on >>>>>>> clocks and expiring timers, but I'm hoping to offer some guarantees in >>>>>>> spite of busted clocks, horrible GC perf, VM suspends and any other way >>>>>>> time is broken. >>>>>>> >>>>>>> Also if people are interested I can go into more detail about what I am >>>>>>> trying to write. >>>>>>> >>>>>>> -jc >>>>> >>> >
