Re: Ensure Leader hasn't changed

Flavio Junqueira Fri, 28 Sep 2012 08:13:17 -0700

I checked your message again and the contract is that getFreshTimestamp() 
always gets a fresh timestamp (not older than any returned by that method). One 
aspect of the contract that is not clear to me: does this freshness property 
hold across clients or only for a single client? If it holds only per client, 
then you can guarantee it with master epochs. No user of your application will 
connect to the master of epoch y < x once it connects to the master of epoch x. 
The epochs can be the sequence number of a sequential znode. How does it sound?


-Flavio  

On Sep 28, 2012, at 5:00 PM, Flavio Junqueira wrote:

> I was thinking that you can do a write per timestamp batch, and not per 
> individual timestamp. In the worst case, a former leader won't use some 
> timestamps, and I would think it is ok, but it depends on your application.
> 
> Also, even if two clients believe they are leaders simultaneously and serve 
> timestamps, the property that you seem to care about the most is uniqueness: 
> a timestamp is not served twice. Order of timestamps would be preserved per 
> leader, but in the case of overlapping leaders you could end up serving 
> timestamps that do not follow an increasing order.
> 
> -Flavio
> 
> On Sep 28, 2012, at 4:37 PM, John Carrino wrote:
> 
>> CompareAndSwap is an atomic check and update. Basically only update the
>> value if it is the same as the expected value.
>> 
>> I think with your approach you'd have to do a write for every single
>> timestamp you wanted to hand out.  The latency hit on this would too much.
>> 
>> My approach is different in that a timestamp server reserves a bunch of
>> timestamps up front and proceeds to hand them out as long as it is the
>> leader.  Leader check can be done without hitting disk hopefully.
>> 
>> Thanks!
>> 
>> -jc
>> 
>> 
>> On Fri, Sep 28, 2012 at 7:19 AM, Flavio Junqueira <[email protected]> wrote:
>> 
>>> I don't know what your compareAndSwap method does, but I was wondering if
>>> your client process can use conditional writes to a znode to make sure that
>>> it was the last one to update the state of timestamp batches. You can treat
>>> the master election problem separately and it does not have to be as strict
>>> as you have been thinking you need. Thats is, it wouldn't hurt if a client
>>> still thinks it is leading even if it is not because no two clients will be
>>> able to update the state of timestamp blocks without noticing that another
>>> client is also updating it.
>>> 
>>> -Flavio
>>> 
>>> On Sep 27, 2012, at 6:57 PM, John Carrino wrote:
>>> 
>>>> So I think it's time to explain what I'm writing just so everyone has
>>>> more situation awareness.  Its just a timestamp server, nothing fancy.
>>>> 
>>>> Looks like this:
>>>> 
>>>> public interface TimestampService {
>>>>  /**
>>>>   * This will get a fresh timestamp that is guarenteed to be newer than
>>>> any other timestamp
>>>>   * handed out before this method was called.
>>>>   */
>>>>  long getFreshTimestamp();
>>>> }
>>>> 
>>>> The only requirement is that the timestamp handed back is greater than
>>>> every other timestamp that was returned before getFreshTs was called.
>>>> There is no ordering requirement for concurrent requests.
>>>> 
>>>> My impl is to reserve blocks of timestamps that are safe to hand out (1M
>>> at
>>>> a time) using compare and swap in ZK.
>>>> lastPossibleUsed = read(HighWater)
>>>> safeToHandout = compareAndSwap(lastPossibleUsed, lastPossibleUsed+1M)
>>>> 
>>>> Now my leader can hand back timestamps up to safeToHandout, but before it
>>>> hands one out it must ensure it is still the leader (no one else has
>>> handed
>>>> back something higher).
>>>> I can use ensureQuorum(), exists(myEphemNode) to make sure this is the
>>>> case.  Now I have a service that is guarenteed to be correct, but doesn't
>>>> require disk hits in the steady state which brings down my latency (if
>>> you
>>>> get close to running out, you can compareAndSwap for more timestamps).
>>>> 
>>>> If many requests come in at the same time I can use smart batching to
>>>> verify happens after for all at once.  We can also add more layers if we
>>>> need more bandwidth to scale up at the cost of adding latency.  Basically
>>>> our latency will be O(lg(requestRate)) if we keep adding layers as each
>>>> previous layer becomes saturated.
>>>> 
>>>> I hope this explanation helps. I am busy for the next 4 hours, but if you
>>>> need more clarification I can respond to them at that time.
>>>> 
>>>> -jc
>>>> 
>>>> 
>>>> On Thu, Sep 27, 2012 at 9:26 AM, John Carrino <[email protected]
>>>> wrote:
>>>> 
>>>>> First, thanks everyone for talking this through with me.
>>>>> 
>>>>> Flavio, for your example, this is actually ok.  There is a happens after
>>>>> relationship between the client making the request and my leader C1
>>> still
>>>>> being the leader.  My service only needs to guarantee that what it hands
>>>>> back is at least as new as anything that existed when the client made
>>> the
>>>>> request.  If C2 were to answer requests while C1 is stalling that is ok
>>>>> because these would be considered concurrent requests and the stuff
>>>>> returned by C2 may be newer but that doesn't violate any guarentees.
>>>>> 
>>>>> If some client were to get back something from C2 and then (happens
>>> after
>>>>> relationship) someone tried to read from C1, it needs to fail.
>>>>> 
>>>>> To address your concern of adding too much bandwidth we can get this
>>>>> easily by doing what Martin Thompson calls smart batching (
>>>>> http://mechanical-sympathy.blogspot.com/2011/10/smart-batching.html).
>>>>> 
>>>>> 1. ensureQuorum request comes in to L1
>>>>> 2. send ENSURE to all followers
>>>>> 3. 10 more ensureQuorum requests come in
>>>>> 4. get back ENSURE from quorum
>>>>> 5. we can now service all 10 pending ensureQuorum requests with another
>>>>> round trip ENSURE.
>>>>> 
>>>>> We don't need to send an ENSURE for every ensureQuorum request, we just
>>>>> need it to be happens after from when the request arrived.
>>>>> 
>>>>> I am fine with the Ephemeral node being removed after some time expires,
>>>>> but only by the leader.  If the leaders clock is broken and the client
>>>>> owning the Ephemeral node drops off, then we don't have liveness
>>> (because
>>>>> this node may not get cleaned up in a timely fashion).  However, we
>>> still
>>>>> preserve corectness.
>>>>> 
>>>>> -jc
>>>>> 
>>>>> 
>>>>> On Thu, Sep 27, 2012 at 9:02 AM, Flavio Junqueira <[email protected]
>>>> wrote:
>>>>> 
>>>>>> Say that we implement what you're suggesting. Could you check if this
>>>>>> scenario can happen:
>>>>>> 
>>>>>> 1- Client C1 is the current leader and it super boosted read to make
>>> sure
>>>>>> it is still the leader;
>>>>>> 2- We process the super boosted read having it through the zab
>>> pipeline;
>>>>>> 3- When we send the response to C1 we slow down the whole deal: the
>>>>>> response to C1 gets delayed and we stall C1;
>>>>>> 4- In the meanwhile, C1's session expires on the server side and its
>>>>>> ephemeral leadership node is removed;
>>>>>> 5- A new client C2 is elected and starts exercising leadership;
>>>>>> 6- Now C1 comes back to normal and receives the response of the super
>>>>>> boosted read saying that it is still the leader.
>>>>>> 
>>>>>> If my interpretation is not incorrect, the only way to prevent this
>>>>>> scenario from happening is if the session expires on the client side
>>> before
>>>>>> it receives the response of the read. It doesn't look like we can do
>>> it if
>>>>>> process clocks can be arbitrarily delayed.
>>>>>> 
>>>>>> Note that one issue is that the behavior of ephemerals is highly
>>>>>> dependent upon timers, so I don't think we can avoid making some timing
>>>>>> assumptions altogether. The question is if we are better off with a
>>>>>> mechanism relying upon acknowledgements. My sense is that
>>> application-level
>>>>>> fencing is preferable (if not necessary) for applications like the
>>> ones JC
>>>>>> is mentioning or BookKeeper.
>>>>>> 
>>>>>> I'm not concerned about writes to disk, which I agree we don't need for
>>>>>> sync. I'm more concerned about having it going through the whole
>>> pipeline,
>>>>>> which will induce more traffic to zab and increase latency for an
>>>>>> application that uses it heavily.
>>>>>> 
>>>>>> -Flavio
>>>>>> 
>>>>>> On Sep 27, 2012, at 5:27 PM, Alexander Shraer wrote:
>>>>>> 
>>>>>>> another idea is to add this functionality to MultiOp - have read only
>>>>>>> transactions be replicated but not logged or logged asynchronously.
>>>>>>> I'm not sure how it works right now if I do a read-only MultiOp
>>>>>>> transaction - does it replicate the transaction or answer it locally
>>>>>>> on the leader ?
>>>>>>> 
>>>>>>> Alex
>>>>>>> 
>>>>>>> On Thu, Sep 27, 2012 at 8:07 AM, Alexander Shraer <[email protected]>
>>>>>> wrote:
>>>>>>>> Thanks for the explanation.
>>>>>>>> 
>>>>>>>> I guess one could always invoke a write operation instead of sync to
>>>>>>>> get the more strict semantics, but as John suggests, it might be a
>>>>>>>> good idea to add a new type of operation that requires followers to
>>>>>>>> ack but doesn't require them to log to disk - this seems sufficient
>>> in
>>>>>>>> our case.
>>>>>>>> 
>>>>>>>> Alex
>>>>>>>> 
>>>>>>>> On Thu, Sep 27, 2012 at 3:56 AM, Flavio Junqueira <[email protected]
>>>> 
>>>>>> wrote:
>>>>>>>>> In theory, the scenario you're describing could happen, but I would
>>>>>> argue that it is unlikely given that: 1) a leader pings followers
>>> twice a
>>>>>> tick to make sure that it has a quorum of supporters (lead()); 2)
>>> followers
>>>>>> give up on a leader upon catching an exception (followLeader()). One
>>> could
>>>>>> calibrate tickTime to make the probability of having this scenario low.
>>>>>>>>> 
>>>>>>>>> Let me also revisit the motivation for the way we designed sync.
>>>>>> ZooKeeper has been designed to serve reads efficiently and making sync
>>> go
>>>>>> through the pipeline would slow down reads. Although optional, we
>>> thought
>>>>>> it would be a good idea to make it as efficient as possible to comply
>>> with
>>>>>> the original expectations for the service. We consequently came up with
>>>>>> this cheap way of making sure that a read sees all pending updates. It
>>> is
>>>>>> correct that there are some corner cases that it doesn't cover. One is
>>> the
>>>>>> case you mentioned. Another is having the sync finishing before the
>>> client
>>>>>> submits the read and having a write committing in between. We rely
>>> upon the
>>>>>> way we implement timeouts and some minimum degree of synchrony for the
>>>>>> clients when submitting operations to guarantee that the scheme work.
>>>>>>>>> 
>>>>>>>>> We thought about the option of having the sync operation going
>>>>>> through the pipeline, and in fact it would have been easier to
>>> implement it
>>>>>> just as a regular write, but we opted not to because we felt it was
>>>>>> sufficient for the use cases we had and more efficient as I already
>>> argued.
>>>>>>>>> 
>>>>>>>>> Hope it helps to clarify.
>>>>>>>>> 
>>>>>>>>> -Flavio
>>>>>>>>> 
>>>>>>>>> On Sep 27, 2012, at 9:38 AM, Alexander Shraer wrote:
>>>>>>>>> 
>>>>>>>>>> thanks for the explanation! but how do you avoid having the
>>> scenario
>>>>>>>>>> raised by John ?
>>>>>>>>>> lets say you're a client connected to F, and F is connected to L.
>>>>>> Lets
>>>>>>>>>> also say that L's pipeline
>>>>>>>>>> is now empty, and both F and L are partitioned from 3 other servers
>>>>>> in
>>>>>>>>>> the system that have already
>>>>>>>>>> elected a new leader L'. Now I go to L' and write something. L
>>> still
>>>>>>>>>> thinks its the leader because the
>>>>>>>>>> detection that followers left it is obviously timeout dependent. So
>>>>>>>>>> when F sends your sync to L and L returns
>>>>>>>>>> it to F, you actually miss my write!
>>>>>>>>>> 
>>>>>>>>>> Alex
>>>>>>>>>> 
>>>>>>>>>> On Thu, Sep 27, 2012 at 12:32 AM, Flavio Junqueira <
>>>>>> [email protected]> wrote:
>>>>>>>>>>> Hi Alex, Because of the following:
>>>>>>>>>>> 
>>>>>>>>>>> 1- A follower F processes operations from a client in FIFO order,
>>>>>> and say that a client submits as you say sync + read;
>>>>>>>>>>> 2- A sync will be processed by the leader and returned to the
>>>>>> follower. It will be queued after all pending updates that the follower
>>>>>> hasn't processed;
>>>>>>>>>>> 3- The follower will process all pending updates before processing
>>>>>> the response of the sync;
>>>>>>>>>>> 4- Once the follower processes the sync, it picks the read
>>>>>> operation to process. It reads the local state of the follower and
>>> returns
>>>>>> to the client.
>>>>>>>>>>> 
>>>>>>>>>>> When we process the read in Step 4, we have applied all pending
>>>>>> updates the leader had for the follower by the time the read request
>>>>>> started.
>>>>>>>>>>> 
>>>>>>>>>>> This implementation is a bit of a hack because it doesn't follow
>>>>>> the same code path as the other operations that go to the leader, but
>>> it
>>>>>> avoids some unnecessary steps, which is important for fast reads. In
>>> the
>>>>>> sync case, the other followers don't really need to know about it
>>> (there is
>>>>>> nothing to be updated) and the leader simply inserts it in the
>>> sequence of
>>>>>> updates of F, ordering it.
>>>>>>>>>>> 
>>>>>>>>>>> -Flavio
>>>>>>>>>>> 
>>>>>>>>>>> On Sep 27, 2012, at 9:12 AM, Alexander Shraer wrote:
>>>>>>>>>>> 
>>>>>>>>>>>> Hi Flavio,
>>>>>>>>>>>> 
>>>>>>>>>>>>> Starting a read operation concurrently with a sync implies that
>>>>>> the result of the read will not miss an update committed before the
>>> read
>>>>>> started.
>>>>>>>>>>>> 
>>>>>>>>>>>> I thought that the intention of sync was to give something like
>>>>>>>>>>>> linearizable reads, so if you invoke a sync and then a read, your
>>>>>> read
>>>>>>>>>>>> is guaranteed to (at least) see any write which completed before
>>>>>> the
>>>>>>>>>>>> sync began. Is this the intention ? If so, how is this achieved
>>>>>>>>>>>> without running agreement on the sync op ?
>>>>>>>>>>>> 
>>>>>>>>>>>> Thanks,
>>>>>>>>>>>> Alex
>>>>>>>>>>>> 
>>>>>>>>>>>> On Thu, Sep 27, 2012 at 12:05 AM, Flavio Junqueira <
>>>>>> [email protected]> wrote:
>>>>>>>>>>>>> sync simply flushes the channel between the leader and the
>>>>>> follower that forwarded the sync operation, so it doesn't go through
>>> the
>>>>>> full zab pipeline. Flushing means that all pending updates from the
>>> leader
>>>>>> to the follower are received by the time sync completes. Starting a
>>> read
>>>>>> operation concurrently with a sync implies that the result of the read
>>> will
>>>>>> not miss an update committed before the read started.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> -Flavio
>>>>>>>>>>>>> 
>>>>>>>>>>>>> On Sep 27, 2012, at 3:43 AM, Alexander Shraer wrote:
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Its strange that sync doesn't run through agreement, I was
>>> always
>>>>>>>>>>>>>> assuming that it is... Exactly for the reason you say -
>>>>>>>>>>>>>> you may trust your leader, but I may have a different leader
>>> and
>>>>>> your
>>>>>>>>>>>>>> leader may not detect it yet and still think its the leader.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> This seems like a bug to me.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Similarly to Paxos, Zookeeper's safety guarantees don't (or
>>>>>> shouldn't)
>>>>>>>>>>>>>> depend on timing assumption.
>>>>>>>>>>>>>> Only progress guarantees depend on time.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Alex
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> On Wed, Sep 26, 2012 at 4:41 PM, John Carrino <
>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>> I have some pretty strong requirements in terms of consistency
>>>>>> where
>>>>>>>>>>>>>>> reading from followers that may be behind in terms of updates
>>>>>> isn't ok for
>>>>>>>>>>>>>>> my use case.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> One error case that worries me is if a follower and leader are
>>>>>> partitioned
>>>>>>>>>>>>>>> off from the network.  A new leader is elected, but the
>>>>>> follower and old
>>>>>>>>>>>>>>> leader don't know about it.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Normally I think sync was made for this purpost, but I looked
>>>>>> at the sync
>>>>>>>>>>>>>>> code and if there aren't any outstanding proposals the leader
>>>>>> sends the
>>>>>>>>>>>>>>> sync right back to the client without first verifying that it
>>>>>> still has
>>>>>>>>>>>>>>> quorum, so this won't work for my use case.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> At the core of the issue all I really need is a call that will
>>>>>> make it's
>>>>>>>>>>>>>>> way to the leader and will ping it's followers, ensure it
>>> still
>>>>>> has a
>>>>>>>>>>>>>>> quorum and return success.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Basically a getCurrentLeaderEpoch() method that will be
>>>>>> forwarded to the
>>>>>>>>>>>>>>> leader, leader will ensure it still has quorum and return it's
>>>>>> epoch.  I
>>>>>>>>>>>>>>> can use this primitive to implement all the other properties I
>>>>>> want to
>>>>>>>>>>>>>>> verify (assuming that my client will never connect to an older
>>>>>> epoch after
>>>>>>>>>>>>>>> this call returns). Also the nice thing about this method is
>>>>>> that it will
>>>>>>>>>>>>>>> not have to hit disk and the latency should just be a round
>>>>>> trip to the
>>>>>>>>>>>>>>> followers.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Most of the guarentees offered by zookeeper are time based an
>>>>>> rely on
>>>>>>>>>>>>>>> clocks and expiring timers, but I'm hoping to offer some
>>>>>> guarantees in
>>>>>>>>>>>>>>> spite of busted clocks, horrible GC perf, VM suspends and any
>>>>>> other way
>>>>>>>>>>>>>>> time is broken.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Also if people are interested I can go into more detail about
>>>>>> what I am
>>>>>>>>>>>>>>> trying to write.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> -jc
>>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>> 
>>>>>> 
>>>>>> 
>>>>> 
>>> 
>>> 
>

Re: Ensure Leader hasn't changed

Reply via email to