another idea is to add this functionality to MultiOp - have read only
transactions be replicated but not logged or logged asynchronously.
I'm not sure how it works right now if I do a read-only MultiOp
transaction - does it replicate the transaction or answer it locally
on the leader ?

Alex

On Thu, Sep 27, 2012 at 8:07 AM, Alexander Shraer <[email protected]> wrote:
> Thanks for the explanation.
>
> I guess one could always invoke a write operation instead of sync to
> get the more strict semantics, but as John suggests, it might be a
> good idea to add a new type of operation that requires followers to
> ack but doesn't require them to log to disk - this seems sufficient in
> our case.
>
> Alex
>
> On Thu, Sep 27, 2012 at 3:56 AM, Flavio Junqueira <[email protected]> wrote:
>> In theory, the scenario you're describing could happen, but I would argue 
>> that it is unlikely given that: 1) a leader pings followers twice a tick to 
>> make sure that it has a quorum of supporters (lead()); 2) followers give up 
>> on a leader upon catching an exception (followLeader()). One could calibrate 
>> tickTime to make the probability of having this scenario low.
>>
>> Let me also revisit the motivation for the way we designed sync. ZooKeeper 
>> has been designed to serve reads efficiently and making sync go through the 
>> pipeline would slow down reads. Although optional, we thought it would be a 
>> good idea to make it as efficient as possible to comply with the original 
>> expectations for the service. We consequently came up with this cheap way of 
>> making sure that a read sees all pending updates. It is correct that there 
>> are some corner cases that it doesn't cover. One is the case you mentioned. 
>> Another is having the sync finishing before the client submits the read and 
>> having a write committing in between. We rely upon the way we implement 
>> timeouts and some minimum degree of synchrony for the clients when 
>> submitting operations to guarantee that the scheme work.
>>
>> We thought about the option of having the sync operation going through the 
>> pipeline, and in fact it would have been easier to implement it just as a 
>> regular write, but we opted not to because we felt it was sufficient for the 
>> use cases we had and more efficient as I already argued.
>>
>> Hope it helps to clarify.
>>
>> -Flavio
>>
>> On Sep 27, 2012, at 9:38 AM, Alexander Shraer wrote:
>>
>>> thanks for the explanation! but how do you avoid having the scenario
>>> raised by John ?
>>> lets say you're a client connected to F, and F is connected to L. Lets
>>> also say that L's pipeline
>>> is now empty, and both F and L are partitioned from 3 other servers in
>>> the system that have already
>>> elected a new leader L'. Now I go to L' and write something. L still
>>> thinks its the leader because the
>>> detection that followers left it is obviously timeout dependent. So
>>> when F sends your sync to L and L returns
>>> it to F, you actually miss my write!
>>>
>>> Alex
>>>
>>> On Thu, Sep 27, 2012 at 12:32 AM, Flavio Junqueira <[email protected]> 
>>> wrote:
>>>> Hi Alex, Because of the following:
>>>>
>>>> 1- A follower F processes operations from a client in FIFO order, and say 
>>>> that a client submits as you say sync + read;
>>>> 2- A sync will be processed by the leader and returned to the follower. It 
>>>> will be queued after all pending updates that the follower hasn't 
>>>> processed;
>>>> 3- The follower will process all pending updates before processing the 
>>>> response of the sync;
>>>> 4- Once the follower processes the sync, it picks the read operation to 
>>>> process. It reads the local state of the follower and returns to the 
>>>> client.
>>>>
>>>> When we process the read in Step 4, we have applied all pending updates 
>>>> the leader had for the follower by the time the read request started.
>>>>
>>>> This implementation is a bit of a hack because it doesn't follow the same 
>>>> code path as the other operations that go to the leader, but it avoids 
>>>> some unnecessary steps, which is important for fast reads. In the sync 
>>>> case, the other followers don't really need to know about it (there is 
>>>> nothing to be updated) and the leader simply inserts it in the sequence of 
>>>> updates of F, ordering it.
>>>>
>>>> -Flavio
>>>>
>>>> On Sep 27, 2012, at 9:12 AM, Alexander Shraer wrote:
>>>>
>>>>> Hi Flavio,
>>>>>
>>>>>> Starting a read operation concurrently with a sync implies that the 
>>>>>> result of the read will not miss an update committed before the read 
>>>>>> started.
>>>>>
>>>>> I thought that the intention of sync was to give something like
>>>>> linearizable reads, so if you invoke a sync and then a read, your read
>>>>> is guaranteed to (at least) see any write which completed before the
>>>>> sync began. Is this the intention ? If so, how is this achieved
>>>>> without running agreement on the sync op ?
>>>>>
>>>>> Thanks,
>>>>> Alex
>>>>>
>>>>> On Thu, Sep 27, 2012 at 12:05 AM, Flavio Junqueira <[email protected]> 
>>>>> wrote:
>>>>>> sync simply flushes the channel between the leader and the follower that 
>>>>>> forwarded the sync operation, so it doesn't go through the full zab 
>>>>>> pipeline. Flushing means that all pending updates from the leader to the 
>>>>>> follower are received by the time sync completes. Starting a read 
>>>>>> operation concurrently with a sync implies that the result of the read 
>>>>>> will not miss an update committed before the read started.
>>>>>>
>>>>>> -Flavio
>>>>>>
>>>>>> On Sep 27, 2012, at 3:43 AM, Alexander Shraer wrote:
>>>>>>
>>>>>>> Its strange that sync doesn't run through agreement, I was always
>>>>>>> assuming that it is... Exactly for the reason you say -
>>>>>>> you may trust your leader, but I may have a different leader and your
>>>>>>> leader may not detect it yet and still think its the leader.
>>>>>>>
>>>>>>> This seems like a bug to me.
>>>>>>>
>>>>>>> Similarly to Paxos, Zookeeper's safety guarantees don't (or shouldn't)
>>>>>>> depend on timing assumption.
>>>>>>> Only progress guarantees depend on time.
>>>>>>>
>>>>>>> Alex
>>>>>>>
>>>>>>>
>>>>>>> On Wed, Sep 26, 2012 at 4:41 PM, John Carrino <[email protected]> 
>>>>>>> wrote:
>>>>>>>> I have some pretty strong requirements in terms of consistency where
>>>>>>>> reading from followers that may be behind in terms of updates isn't ok 
>>>>>>>> for
>>>>>>>> my use case.
>>>>>>>>
>>>>>>>> One error case that worries me is if a follower and leader are 
>>>>>>>> partitioned
>>>>>>>> off from the network.  A new leader is elected, but the follower and 
>>>>>>>> old
>>>>>>>> leader don't know about it.
>>>>>>>>
>>>>>>>> Normally I think sync was made for this purpost, but I looked at the 
>>>>>>>> sync
>>>>>>>> code and if there aren't any outstanding proposals the leader sends the
>>>>>>>> sync right back to the client without first verifying that it still has
>>>>>>>> quorum, so this won't work for my use case.
>>>>>>>>
>>>>>>>> At the core of the issue all I really need is a call that will make 
>>>>>>>> it's
>>>>>>>> way to the leader and will ping it's followers, ensure it still has a
>>>>>>>> quorum and return success.
>>>>>>>>
>>>>>>>> Basically a getCurrentLeaderEpoch() method that will be forwarded to 
>>>>>>>> the
>>>>>>>> leader, leader will ensure it still has quorum and return it's epoch.  
>>>>>>>> I
>>>>>>>> can use this primitive to implement all the other properties I want to
>>>>>>>> verify (assuming that my client will never connect to an older epoch 
>>>>>>>> after
>>>>>>>> this call returns). Also the nice thing about this method is that it 
>>>>>>>> will
>>>>>>>> not have to hit disk and the latency should just be a round trip to the
>>>>>>>> followers.
>>>>>>>>
>>>>>>>> Most of the guarentees offered by zookeeper are time based an rely on
>>>>>>>> clocks and expiring timers, but I'm hoping to offer some guarantees in
>>>>>>>> spite of busted clocks, horrible GC perf, VM suspends and any other way
>>>>>>>> time is broken.
>>>>>>>>
>>>>>>>> Also if people are interested I can go into more detail about what I am
>>>>>>>> trying to write.
>>>>>>>>
>>>>>>>> -jc
>>>>>>
>>>>
>>

Reply via email to