Thanks for the explanation.

I guess one could always invoke a write operation instead of sync to
get the more strict semantics, but as John suggests, it might be a
good idea to add a new type of operation that requires followers to
ack but doesn't require them to log to disk - this seems sufficient in
our case.

Alex

On Thu, Sep 27, 2012 at 3:56 AM, Flavio Junqueira <[email protected]> wrote:
> In theory, the scenario you're describing could happen, but I would argue 
> that it is unlikely given that: 1) a leader pings followers twice a tick to 
> make sure that it has a quorum of supporters (lead()); 2) followers give up 
> on a leader upon catching an exception (followLeader()). One could calibrate 
> tickTime to make the probability of having this scenario low.
>
> Let me also revisit the motivation for the way we designed sync. ZooKeeper 
> has been designed to serve reads efficiently and making sync go through the 
> pipeline would slow down reads. Although optional, we thought it would be a 
> good idea to make it as efficient as possible to comply with the original 
> expectations for the service. We consequently came up with this cheap way of 
> making sure that a read sees all pending updates. It is correct that there 
> are some corner cases that it doesn't cover. One is the case you mentioned. 
> Another is having the sync finishing before the client submits the read and 
> having a write committing in between. We rely upon the way we implement 
> timeouts and some minimum degree of synchrony for the clients when submitting 
> operations to guarantee that the scheme work.
>
> We thought about the option of having the sync operation going through the 
> pipeline, and in fact it would have been easier to implement it just as a 
> regular write, but we opted not to because we felt it was sufficient for the 
> use cases we had and more efficient as I already argued.
>
> Hope it helps to clarify.
>
> -Flavio
>
> On Sep 27, 2012, at 9:38 AM, Alexander Shraer wrote:
>
>> thanks for the explanation! but how do you avoid having the scenario
>> raised by John ?
>> lets say you're a client connected to F, and F is connected to L. Lets
>> also say that L's pipeline
>> is now empty, and both F and L are partitioned from 3 other servers in
>> the system that have already
>> elected a new leader L'. Now I go to L' and write something. L still
>> thinks its the leader because the
>> detection that followers left it is obviously timeout dependent. So
>> when F sends your sync to L and L returns
>> it to F, you actually miss my write!
>>
>> Alex
>>
>> On Thu, Sep 27, 2012 at 12:32 AM, Flavio Junqueira <[email protected]> 
>> wrote:
>>> Hi Alex, Because of the following:
>>>
>>> 1- A follower F processes operations from a client in FIFO order, and say 
>>> that a client submits as you say sync + read;
>>> 2- A sync will be processed by the leader and returned to the follower. It 
>>> will be queued after all pending updates that the follower hasn't processed;
>>> 3- The follower will process all pending updates before processing the 
>>> response of the sync;
>>> 4- Once the follower processes the sync, it picks the read operation to 
>>> process. It reads the local state of the follower and returns to the client.
>>>
>>> When we process the read in Step 4, we have applied all pending updates the 
>>> leader had for the follower by the time the read request started.
>>>
>>> This implementation is a bit of a hack because it doesn't follow the same 
>>> code path as the other operations that go to the leader, but it avoids some 
>>> unnecessary steps, which is important for fast reads. In the sync case, the 
>>> other followers don't really need to know about it (there is nothing to be 
>>> updated) and the leader simply inserts it in the sequence of updates of F, 
>>> ordering it.
>>>
>>> -Flavio
>>>
>>> On Sep 27, 2012, at 9:12 AM, Alexander Shraer wrote:
>>>
>>>> Hi Flavio,
>>>>
>>>>> Starting a read operation concurrently with a sync implies that the 
>>>>> result of the read will not miss an update committed before the read 
>>>>> started.
>>>>
>>>> I thought that the intention of sync was to give something like
>>>> linearizable reads, so if you invoke a sync and then a read, your read
>>>> is guaranteed to (at least) see any write which completed before the
>>>> sync began. Is this the intention ? If so, how is this achieved
>>>> without running agreement on the sync op ?
>>>>
>>>> Thanks,
>>>> Alex
>>>>
>>>> On Thu, Sep 27, 2012 at 12:05 AM, Flavio Junqueira <[email protected]> 
>>>> wrote:
>>>>> sync simply flushes the channel between the leader and the follower that 
>>>>> forwarded the sync operation, so it doesn't go through the full zab 
>>>>> pipeline. Flushing means that all pending updates from the leader to the 
>>>>> follower are received by the time sync completes. Starting a read 
>>>>> operation concurrently with a sync implies that the result of the read 
>>>>> will not miss an update committed before the read started.
>>>>>
>>>>> -Flavio
>>>>>
>>>>> On Sep 27, 2012, at 3:43 AM, Alexander Shraer wrote:
>>>>>
>>>>>> Its strange that sync doesn't run through agreement, I was always
>>>>>> assuming that it is... Exactly for the reason you say -
>>>>>> you may trust your leader, but I may have a different leader and your
>>>>>> leader may not detect it yet and still think its the leader.
>>>>>>
>>>>>> This seems like a bug to me.
>>>>>>
>>>>>> Similarly to Paxos, Zookeeper's safety guarantees don't (or shouldn't)
>>>>>> depend on timing assumption.
>>>>>> Only progress guarantees depend on time.
>>>>>>
>>>>>> Alex
>>>>>>
>>>>>>
>>>>>> On Wed, Sep 26, 2012 at 4:41 PM, John Carrino <[email protected]> 
>>>>>> wrote:
>>>>>>> I have some pretty strong requirements in terms of consistency where
>>>>>>> reading from followers that may be behind in terms of updates isn't ok 
>>>>>>> for
>>>>>>> my use case.
>>>>>>>
>>>>>>> One error case that worries me is if a follower and leader are 
>>>>>>> partitioned
>>>>>>> off from the network.  A new leader is elected, but the follower and old
>>>>>>> leader don't know about it.
>>>>>>>
>>>>>>> Normally I think sync was made for this purpost, but I looked at the 
>>>>>>> sync
>>>>>>> code and if there aren't any outstanding proposals the leader sends the
>>>>>>> sync right back to the client without first verifying that it still has
>>>>>>> quorum, so this won't work for my use case.
>>>>>>>
>>>>>>> At the core of the issue all I really need is a call that will make it's
>>>>>>> way to the leader and will ping it's followers, ensure it still has a
>>>>>>> quorum and return success.
>>>>>>>
>>>>>>> Basically a getCurrentLeaderEpoch() method that will be forwarded to the
>>>>>>> leader, leader will ensure it still has quorum and return it's epoch.  I
>>>>>>> can use this primitive to implement all the other properties I want to
>>>>>>> verify (assuming that my client will never connect to an older epoch 
>>>>>>> after
>>>>>>> this call returns). Also the nice thing about this method is that it 
>>>>>>> will
>>>>>>> not have to hit disk and the latency should just be a round trip to the
>>>>>>> followers.
>>>>>>>
>>>>>>> Most of the guarentees offered by zookeeper are time based an rely on
>>>>>>> clocks and expiring timers, but I'm hoping to offer some guarantees in
>>>>>>> spite of busted clocks, horrible GC perf, VM suspends and any other way
>>>>>>> time is broken.
>>>>>>>
>>>>>>> Also if people are interested I can go into more detail about what I am
>>>>>>> trying to write.
>>>>>>>
>>>>>>> -jc
>>>>>
>>>
>

Reply via email to