thanks for the explanation! but how do you avoid having the scenario
raised by John ?
lets say you're a client connected to F, and F is connected to L. Lets
also say that L's pipeline
is now empty, and both F and L are partitioned from 3 other servers in
the system that have already
elected a new leader L'. Now I go to L' and write something. L still
thinks its the leader because the
detection that followers left it is obviously timeout dependent. So
when F sends your sync to L and L returns
it to F, you actually miss my write!

Alex

On Thu, Sep 27, 2012 at 12:32 AM, Flavio Junqueira <[email protected]> wrote:
> Hi Alex, Because of the following:
>
> 1- A follower F processes operations from a client in FIFO order, and say 
> that a client submits as you say sync + read;
> 2- A sync will be processed by the leader and returned to the follower. It 
> will be queued after all pending updates that the follower hasn't processed;
> 3- The follower will process all pending updates before processing the 
> response of the sync;
> 4- Once the follower processes the sync, it picks the read operation to 
> process. It reads the local state of the follower and returns to the client.
>
> When we process the read in Step 4, we have applied all pending updates the 
> leader had for the follower by the time the read request started.
>
> This implementation is a bit of a hack because it doesn't follow the same 
> code path as the other operations that go to the leader, but it avoids some 
> unnecessary steps, which is important for fast reads. In the sync case, the 
> other followers don't really need to know about it (there is nothing to be 
> updated) and the leader simply inserts it in the sequence of updates of F, 
> ordering it.
>
> -Flavio
>
> On Sep 27, 2012, at 9:12 AM, Alexander Shraer wrote:
>
>> Hi Flavio,
>>
>>> Starting a read operation concurrently with a sync implies that the result 
>>> of the read will not miss an update committed before the read started.
>>
>> I thought that the intention of sync was to give something like
>> linearizable reads, so if you invoke a sync and then a read, your read
>> is guaranteed to (at least) see any write which completed before the
>> sync began. Is this the intention ? If so, how is this achieved
>> without running agreement on the sync op ?
>>
>> Thanks,
>> Alex
>>
>> On Thu, Sep 27, 2012 at 12:05 AM, Flavio Junqueira <[email protected]> 
>> wrote:
>>> sync simply flushes the channel between the leader and the follower that 
>>> forwarded the sync operation, so it doesn't go through the full zab 
>>> pipeline. Flushing means that all pending updates from the leader to the 
>>> follower are received by the time sync completes. Starting a read operation 
>>> concurrently with a sync implies that the result of the read will not miss 
>>> an update committed before the read started.
>>>
>>> -Flavio
>>>
>>> On Sep 27, 2012, at 3:43 AM, Alexander Shraer wrote:
>>>
>>>> Its strange that sync doesn't run through agreement, I was always
>>>> assuming that it is... Exactly for the reason you say -
>>>> you may trust your leader, but I may have a different leader and your
>>>> leader may not detect it yet and still think its the leader.
>>>>
>>>> This seems like a bug to me.
>>>>
>>>> Similarly to Paxos, Zookeeper's safety guarantees don't (or shouldn't)
>>>> depend on timing assumption.
>>>> Only progress guarantees depend on time.
>>>>
>>>> Alex
>>>>
>>>>
>>>> On Wed, Sep 26, 2012 at 4:41 PM, John Carrino <[email protected]> 
>>>> wrote:
>>>>> I have some pretty strong requirements in terms of consistency where
>>>>> reading from followers that may be behind in terms of updates isn't ok for
>>>>> my use case.
>>>>>
>>>>> One error case that worries me is if a follower and leader are partitioned
>>>>> off from the network.  A new leader is elected, but the follower and old
>>>>> leader don't know about it.
>>>>>
>>>>> Normally I think sync was made for this purpost, but I looked at the sync
>>>>> code and if there aren't any outstanding proposals the leader sends the
>>>>> sync right back to the client without first verifying that it still has
>>>>> quorum, so this won't work for my use case.
>>>>>
>>>>> At the core of the issue all I really need is a call that will make it's
>>>>> way to the leader and will ping it's followers, ensure it still has a
>>>>> quorum and return success.
>>>>>
>>>>> Basically a getCurrentLeaderEpoch() method that will be forwarded to the
>>>>> leader, leader will ensure it still has quorum and return it's epoch.  I
>>>>> can use this primitive to implement all the other properties I want to
>>>>> verify (assuming that my client will never connect to an older epoch after
>>>>> this call returns). Also the nice thing about this method is that it will
>>>>> not have to hit disk and the latency should just be a round trip to the
>>>>> followers.
>>>>>
>>>>> Most of the guarentees offered by zookeeper are time based an rely on
>>>>> clocks and expiring timers, but I'm hoping to offer some guarantees in
>>>>> spite of busted clocks, horrible GC perf, VM suspends and any other way
>>>>> time is broken.
>>>>>
>>>>> Also if people are interested I can go into more detail about what I am
>>>>> trying to write.
>>>>>
>>>>> -jc
>>>
>

Reply via email to