This seems like a good compromise.  We still have to eat the latency of a
write, but we easily achieve smart batching in this case so many
outstanding sync can all be serviced by the same lastPending request.

-jc


On Thu, Sep 27, 2012 at 11:17 PM, Benjamin Reed <[email protected]> wrote:

> there is a very easy solution to this. we only rely on clocks in the
> case that there are no pending transactions. (if there are pending
> transactions, the sync will only return if in fact the leader is still
> the leader, otherwise the transaction that the sync is waiting on will
> never commit and the sync will never return.)
>
> so, if there aren't any transactions, just submit one. make it a bogus
> one: create / for example. then queue the sync behind it.
>
> ben
>
> ps - we bring up this issue and the solution and the rational for the
> current implementation in section 4.4 of the zookeeper usenix paper.
>
> On Thu, Sep 27, 2012 at 9:57 AM, John Carrino <[email protected]>
> wrote:
> > So I think it's time to explain what I'm writing just so everyone has
> more
> > situation awareness.  Its just a timestamp server, nothing fancy.
> >
> > Looks like this:
> >
> > public interface TimestampService {
> >     /**
> >      * This will get a fresh timestamp that is guarenteed to be newer
> than
> > any other timestamp
> >      * handed out before this method was called.
> >      */
> >     long getFreshTimestamp();
> > }
> >
> > The only requirement is that the timestamp handed back is greater than
> every
> > other timestamp that was returned before getFreshTs was called.  There
> is no
> > ordering requirement for concurrent requests.
> >
> > My impl is to reserve blocks of timestamps that are safe to hand out (1M
> at
> > a time) using compare and swap in ZK.
> > lastPossibleUsed = read(HighWater)
> > safeToHandout = compareAndSwap(lastPossibleUsed, lastPossibleUsed+1M)
> >
> > Now my leader can hand back timestamps up to safeToHandout, but before it
> > hands one out it must ensure it is still the leader (no one else has
> handed
> > back something higher).
> > I can use ensureQuorum(), exists(myEphemNode) to make sure this is the
> case.
> > Now I have a service that is guarenteed to be correct, but doesn't
> require
> > disk hits in the steady state which brings down my latency (if you get
> close
> > to running out, you can compareAndSwap for more timestamps).
> >
> > If many requests come in at the same time I can use smart batching to
> verify
> > happens after for all at once.  We can also add more layers if we need
> more
> > bandwidth to scale up at the cost of adding latency.  Basically our
> latency
> > will be O(lg(requestRate)) if we keep adding layers as each previous
> layer
> > becomes saturated.
> >
> > I hope this explanation helps. I am busy for the next 4 hours, but if you
> > need more clarification I can respond to them at that time.
> >
> > -jc
> >
> >
> > On Thu, Sep 27, 2012 at 9:26 AM, John Carrino <[email protected]>
> > wrote:
> >>
> >> First, thanks everyone for talking this through with me.
> >>
> >> Flavio, for your example, this is actually ok.  There is a happens after
> >> relationship between the client making the request and my leader C1
> still
> >> being the leader.  My service only needs to guarantee that what it hands
> >> back is at least as new as anything that existed when the client made
> the
> >> request.  If C2 were to answer requests while C1 is stalling that is ok
> >> because these would be considered concurrent requests and the stuff
> returned
> >> by C2 may be newer but that doesn't violate any guarentees.
> >>
> >> If some client were to get back something from C2 and then (happens
> after
> >> relationship) someone tried to read from C1, it needs to fail.
> >>
> >> To address your concern of adding too much bandwidth we can get this
> >> easily by doing what Martin Thompson calls smart batching
> >> (http://mechanical-sympathy.blogspot.com/2011/10/smart-batching.html).
> >>
> >> 1. ensureQuorum request comes in to L1
> >> 2. send ENSURE to all followers
> >> 3. 10 more ensureQuorum requests come in
> >> 4. get back ENSURE from quorum
> >> 5. we can now service all 10 pending ensureQuorum requests with another
> >> round trip ENSURE.
> >>
> >> We don't need to send an ENSURE for every ensureQuorum request, we just
> >> need it to be happens after from when the request arrived.
> >>
> >> I am fine with the Ephemeral node being removed after some time expires,
> >> but only by the leader.  If the leaders clock is broken and the client
> >> owning the Ephemeral node drops off, then we don't have liveness
> (because
> >> this node may not get cleaned up in a timely fashion).  However, we
> still
> >> preserve corectness.
> >>
> >> -jc
> >>
> >>
> >> On Thu, Sep 27, 2012 at 9:02 AM, Flavio Junqueira <[email protected]>
> >> wrote:
> >>>
> >>> Say that we implement what you're suggesting. Could you check if this
> >>> scenario can happen:
> >>>
> >>> 1- Client C1 is the current leader and it super boosted read to make
> sure
> >>> it is still the leader;
> >>> 2- We process the super boosted read having it through the zab
> pipeline;
> >>> 3- When we send the response to C1 we slow down the whole deal: the
> >>> response to C1 gets delayed and we stall C1;
> >>> 4- In the meanwhile, C1's session expires on the server side and its
> >>> ephemeral leadership node is removed;
> >>> 5- A new client C2 is elected and starts exercising leadership;
> >>> 6- Now C1 comes back to normal and receives the response of the super
> >>> boosted read saying that it is still the leader.
> >>>
> >>> If my interpretation is not incorrect, the only way to prevent this
> >>> scenario from happening is if the session expires on the client side
> before
> >>> it receives the response of the read. It doesn't look like we can do
> it if
> >>> process clocks can be arbitrarily delayed.
> >>>
> >>> Note that one issue is that the behavior of ephemerals is highly
> >>> dependent upon timers, so I don't think we can avoid making some timing
> >>> assumptions altogether. The question is if we are better off with a
> >>> mechanism relying upon acknowledgements. My sense is that
> application-level
> >>> fencing is preferable (if not necessary) for applications like the
> ones JC
> >>> is mentioning or BookKeeper.
> >>>
> >>> I'm not concerned about writes to disk, which I agree we don't need for
> >>> sync. I'm more concerned about having it going through the whole
> pipeline,
> >>> which will induce more traffic to zab and increase latency for an
> >>> application that uses it heavily.
> >>>
> >>> -Flavio
> >>>
> >>> On Sep 27, 2012, at 5:27 PM, Alexander Shraer wrote:
> >>>
> >>> > another idea is to add this functionality to MultiOp - have read only
> >>> > transactions be replicated but not logged or logged asynchronously.
> >>> > I'm not sure how it works right now if I do a read-only MultiOp
> >>> > transaction - does it replicate the transaction or answer it locally
> >>> > on the leader ?
> >>> >
> >>> > Alex
> >>> >
> >>> > On Thu, Sep 27, 2012 at 8:07 AM, Alexander Shraer <[email protected]
> >
> >>> > wrote:
> >>> >> Thanks for the explanation.
> >>> >>
> >>> >> I guess one could always invoke a write operation instead of sync to
> >>> >> get the more strict semantics, but as John suggests, it might be a
> >>> >> good idea to add a new type of operation that requires followers to
> >>> >> ack but doesn't require them to log to disk - this seems sufficient
> in
> >>> >> our case.
> >>> >>
> >>> >> Alex
> >>> >>
> >>> >> On Thu, Sep 27, 2012 at 3:56 AM, Flavio Junqueira <
> [email protected]>
> >>> >> wrote:
> >>> >>> In theory, the scenario you're describing could happen, but I would
> >>> >>> argue that it is unlikely given that: 1) a leader pings followers
> twice a
> >>> >>> tick to make sure that it has a quorum of supporters (lead()); 2)
> followers
> >>> >>> give up on a leader upon catching an exception (followLeader()).
> One could
> >>> >>> calibrate tickTime to make the probability of having this scenario
> low.
> >>> >>>
> >>> >>> Let me also revisit the motivation for the way we designed sync.
> >>> >>> ZooKeeper has been designed to serve reads efficiently and making
> sync go
> >>> >>> through the pipeline would slow down reads. Although optional, we
> thought it
> >>> >>> would be a good idea to make it as efficient as possible to comply
> with the
> >>> >>> original expectations for the service. We consequently came up
> with this
> >>> >>> cheap way of making sure that a read sees all pending updates. It
> is correct
> >>> >>> that there are some corner cases that it doesn't cover. One is the
> case you
> >>> >>> mentioned. Another is having the sync finishing before the client
> submits
> >>> >>> the read and having a write committing in between. We rely upon
> the way we
> >>> >>> implement timeouts and some minimum degree of synchrony for the
> clients when
> >>> >>> submitting operations to guarantee that the scheme work.
> >>> >>>
> >>> >>> We thought about the option of having the sync operation going
> >>> >>> through the pipeline, and in fact it would have been easier to
> implement it
> >>> >>> just as a regular write, but we opted not to because we felt it was
> >>> >>> sufficient for the use cases we had and more efficient as I
> already argued.
> >>> >>>
> >>> >>> Hope it helps to clarify.
> >>> >>>
> >>> >>> -Flavio
> >>> >>>
> >>> >>> On Sep 27, 2012, at 9:38 AM, Alexander Shraer wrote:
> >>> >>>
> >>> >>>> thanks for the explanation! but how do you avoid having the
> scenario
> >>> >>>> raised by John ?
> >>> >>>> lets say you're a client connected to F, and F is connected to L.
> >>> >>>> Lets
> >>> >>>> also say that L's pipeline
> >>> >>>> is now empty, and both F and L are partitioned from 3 other
> servers
> >>> >>>> in
> >>> >>>> the system that have already
> >>> >>>> elected a new leader L'. Now I go to L' and write something. L
> still
> >>> >>>> thinks its the leader because the
> >>> >>>> detection that followers left it is obviously timeout dependent.
> So
> >>> >>>> when F sends your sync to L and L returns
> >>> >>>> it to F, you actually miss my write!
> >>> >>>>
> >>> >>>> Alex
> >>> >>>>
> >>> >>>> On Thu, Sep 27, 2012 at 12:32 AM, Flavio Junqueira
> >>> >>>> <[email protected]> wrote:
> >>> >>>>> Hi Alex, Because of the following:
> >>> >>>>>
> >>> >>>>> 1- A follower F processes operations from a client in FIFO order,
> >>> >>>>> and say that a client submits as you say sync + read;
> >>> >>>>> 2- A sync will be processed by the leader and returned to the
> >>> >>>>> follower. It will be queued after all pending updates that the
> follower
> >>> >>>>> hasn't processed;
> >>> >>>>> 3- The follower will process all pending updates before
> processing
> >>> >>>>> the response of the sync;
> >>> >>>>> 4- Once the follower processes the sync, it picks the read
> >>> >>>>> operation to process. It reads the local state of the follower
> and returns
> >>> >>>>> to the client.
> >>> >>>>>
> >>> >>>>> When we process the read in Step 4, we have applied all pending
> >>> >>>>> updates the leader had for the follower by the time the read
> request
> >>> >>>>> started.
> >>> >>>>>
> >>> >>>>> This implementation is a bit of a hack because it doesn't follow
> >>> >>>>> the same code path as the other operations that go to the
> leader, but it
> >>> >>>>> avoids some unnecessary steps, which is important for fast
> reads. In the
> >>> >>>>> sync case, the other followers don't really need to know about
> it (there is
> >>> >>>>> nothing to be updated) and the leader simply inserts it in the
> sequence of
> >>> >>>>> updates of F, ordering it.
> >>> >>>>>
> >>> >>>>> -Flavio
> >>> >>>>>
> >>> >>>>> On Sep 27, 2012, at 9:12 AM, Alexander Shraer wrote:
> >>> >>>>>
> >>> >>>>>> Hi Flavio,
> >>> >>>>>>
> >>> >>>>>>> Starting a read operation concurrently with a sync implies that
> >>> >>>>>>> the result of the read will not miss an update committed
> before the read
> >>> >>>>>>> started.
> >>> >>>>>>
> >>> >>>>>> I thought that the intention of sync was to give something like
> >>> >>>>>> linearizable reads, so if you invoke a sync and then a read,
> your
> >>> >>>>>> read
> >>> >>>>>> is guaranteed to (at least) see any write which completed before
> >>> >>>>>> the
> >>> >>>>>> sync began. Is this the intention ? If so, how is this achieved
> >>> >>>>>> without running agreement on the sync op ?
> >>> >>>>>>
> >>> >>>>>> Thanks,
> >>> >>>>>> Alex
> >>> >>>>>>
> >>> >>>>>> On Thu, Sep 27, 2012 at 12:05 AM, Flavio Junqueira
> >>> >>>>>> <[email protected]> wrote:
> >>> >>>>>>> sync simply flushes the channel between the leader and the
> >>> >>>>>>> follower that forwarded the sync operation, so it doesn't go
> through the
> >>> >>>>>>> full zab pipeline. Flushing means that all pending updates
> from the leader
> >>> >>>>>>> to the follower are received by the time sync completes.
> Starting a read
> >>> >>>>>>> operation concurrently with a sync implies that the result of
> the read will
> >>> >>>>>>> not miss an update committed before the read started.
> >>> >>>>>>>
> >>> >>>>>>> -Flavio
> >>> >>>>>>>
> >>> >>>>>>> On Sep 27, 2012, at 3:43 AM, Alexander Shraer wrote:
> >>> >>>>>>>
> >>> >>>>>>>> Its strange that sync doesn't run through agreement, I was
> >>> >>>>>>>> always
> >>> >>>>>>>> assuming that it is... Exactly for the reason you say -
> >>> >>>>>>>> you may trust your leader, but I may have a different leader
> and
> >>> >>>>>>>> your
> >>> >>>>>>>> leader may not detect it yet and still think its the leader.
> >>> >>>>>>>>
> >>> >>>>>>>> This seems like a bug to me.
> >>> >>>>>>>>
> >>> >>>>>>>> Similarly to Paxos, Zookeeper's safety guarantees don't (or
> >>> >>>>>>>> shouldn't)
> >>> >>>>>>>> depend on timing assumption.
> >>> >>>>>>>> Only progress guarantees depend on time.
> >>> >>>>>>>>
> >>> >>>>>>>> Alex
> >>> >>>>>>>>
> >>> >>>>>>>>
> >>> >>>>>>>> On Wed, Sep 26, 2012 at 4:41 PM, John Carrino
> >>> >>>>>>>> <[email protected]> wrote:
> >>> >>>>>>>>> I have some pretty strong requirements in terms of
> consistency
> >>> >>>>>>>>> where
> >>> >>>>>>>>> reading from followers that may be behind in terms of updates
> >>> >>>>>>>>> isn't ok for
> >>> >>>>>>>>> my use case.
> >>> >>>>>>>>>
> >>> >>>>>>>>> One error case that worries me is if a follower and leader
> are
> >>> >>>>>>>>> partitioned
> >>> >>>>>>>>> off from the network.  A new leader is elected, but the
> >>> >>>>>>>>> follower and old
> >>> >>>>>>>>> leader don't know about it.
> >>> >>>>>>>>>
> >>> >>>>>>>>> Normally I think sync was made for this purpost, but I looked
> >>> >>>>>>>>> at the sync
> >>> >>>>>>>>> code and if there aren't any outstanding proposals the leader
> >>> >>>>>>>>> sends the
> >>> >>>>>>>>> sync right back to the client without first verifying that it
> >>> >>>>>>>>> still has
> >>> >>>>>>>>> quorum, so this won't work for my use case.
> >>> >>>>>>>>>
> >>> >>>>>>>>> At the core of the issue all I really need is a call that
> will
> >>> >>>>>>>>> make it's
> >>> >>>>>>>>> way to the leader and will ping it's followers, ensure it
> still
> >>> >>>>>>>>> has a
> >>> >>>>>>>>> quorum and return success.
> >>> >>>>>>>>>
> >>> >>>>>>>>> Basically a getCurrentLeaderEpoch() method that will be
> >>> >>>>>>>>> forwarded to the
> >>> >>>>>>>>> leader, leader will ensure it still has quorum and return
> it's
> >>> >>>>>>>>> epoch.  I
> >>> >>>>>>>>> can use this primitive to implement all the other properties
> I
> >>> >>>>>>>>> want to
> >>> >>>>>>>>> verify (assuming that my client will never connect to an
> older
> >>> >>>>>>>>> epoch after
> >>> >>>>>>>>> this call returns). Also the nice thing about this method is
> >>> >>>>>>>>> that it will
> >>> >>>>>>>>> not have to hit disk and the latency should just be a round
> >>> >>>>>>>>> trip to the
> >>> >>>>>>>>> followers.
> >>> >>>>>>>>>
> >>> >>>>>>>>> Most of the guarentees offered by zookeeper are time based an
> >>> >>>>>>>>> rely on
> >>> >>>>>>>>> clocks and expiring timers, but I'm hoping to offer some
> >>> >>>>>>>>> guarantees in
> >>> >>>>>>>>> spite of busted clocks, horrible GC perf, VM suspends and any
> >>> >>>>>>>>> other way
> >>> >>>>>>>>> time is broken.
> >>> >>>>>>>>>
> >>> >>>>>>>>> Also if people are interested I can go into more detail about
> >>> >>>>>>>>> what I am
> >>> >>>>>>>>> trying to write.
> >>> >>>>>>>>>
> >>> >>>>>>>>> -jc
> >>> >>>>>>>
> >>> >>>>>
> >>> >>>
> >>>
> >>
> >
>

Reply via email to