Hi Alex, Hongchao, I agree with the assertion that we should favor
correctness here.

Patrick

On Mon, Mar 9, 2015 at 9:10 PM, Alexander Shraer <[email protected]> wrote:
> Bumping this thread as it was brought up again by Hongchao in the context of
> failing tests.
> I agree with him that its much better to make sync a quorum operation,
> however this may come in the expense
> of slightly slower strong reads. I think this is acceptable since if someone
> does "sync + read" he cares more about semantics
> than latency.
>
> On Fri, Mar 1, 2013 at 7:06 PM, Alexander Shraer <[email protected]> wrote:
>>
>> This is an old thread (below), but it doesn't seem like any conclusion
>> was reached on what we want to do to address the issue.
>>
>> Reminder of the problem: sync only gets you strong semantics if there
>> is no leader change. If there is a leader change,
>> then these semantics are guaranteed only if we make some timing
>> assumptions, not made elsewhere in ZooKeeper. It would be much
>> better not to make timing assumptions for such safety/consistency
>> properties, only for liveness.
>>
>> The problem happens when your leader is no longer the leader but
>> doesn't know it yet. He responds to a sync, but that doesn't mean
>> you follower sees all committed state. Some other server may have
>> already become the leader and committed some updates, which the sync
>> won't flush to
>> your follower, which is still connected to the old leader.
>>
>> To prevent this we should broadcast the sync like updates, or
>> piggyback them on other ops, or perhaps create a new type of sync that
>> is broadcasted.
>>
>> As Ben pointed out, this problem is also mentioned in Section 4.4 of
>> the ZooKeeper peper (but the proposed solution there is insufficient
>> to solve the issue, as
>> discussed below).
>>
>> Alex
>>
>>
>> On Fri, Sep 28, 2012 at 4:45 PM, John Carrino <[email protected]>
>> wrote:
>> > Ben, after thinking about this more. I don't think this solution gets
>> > the
>> > property that I need.  Just because there are outstanding proposals that
>> > are
>> > committed later doesn't imply we are still the leader.  It only means
>> > that
>> > when the new leader does recovery it will also see these proposals as
>> > committed.
>> >
>> > Let's say we have a 5 node cluster and L1 has one pending request out.
>> > F2-F5
>> > are followers. We get back an ack from F2.  Now F5 and L1 are
>> > partitioned
>> > off from the network along with client C1.
>> >
>> > Recovery happens on F2-F4 and F2 becomes L2.  During recovery this
>> > proposal
>> > is accepted because F2 had acked it.  Now L2 does a bunch of stuff
>> > including
>> > deleting your ephemeral node.
>> >
>> > Now a sync comes in from C1 through F5. Now L1 finally gets that ack
>> > from F5
>> > and goes ahead and commits it and responds to the outstanding sync
>> > request
>> > to C1.
>> >
>> > We can see with this ordering there isn't a happens after relationship
>> > between the sync request and knowing about all commits that occurred
>> > before
>> > the sync request.
>> >
>> > Yes, I realize that this ordering is unlikely to happen in practice, but
>> > I
>> > hate trusting time for anything.
>> >
>> > -jc
>> >
>> > On Fri, Sep 28, 2012 at 7:31 AM, John Carrino <[email protected]>
>> > wrote:
>> >>
>> >> This seems like a good compromise.  We still have to eat the latency of
>> >> a
>> >> write, but we easily achieve smart batching in this case so many
>> >> outstanding
>> >> sync can all be serviced by the same lastPending request.
>> >>
>> >> -jc
>> >>
>> >>
>> >> On Thu, Sep 27, 2012 at 11:17 PM, Benjamin Reed <[email protected]>
>> >> wrote:
>> >>>
>> >>> there is a very easy solution to this. we only rely on clocks in the
>> >>> case that there are no pending transactions. (if there are pending
>> >>> transactions, the sync will only return if in fact the leader is still
>> >>> the leader, otherwise the transaction that the sync is waiting on will
>> >>> never commit and the sync will never return.)
>> >>>
>> >>> so, if there aren't any transactions, just submit one. make it a bogus
>> >>> one: create / for example. then queue the sync behind it.
>> >>>
>> >>> ben
>> >>>
>> >>> ps - we bring up this issue and the solution and the rational for the
>> >>> current implementation in section 4.4 of the zookeeper usenix paper.
>> >>>
>> >>> On Thu, Sep 27, 2012 at 9:57 AM, John Carrino <[email protected]>
>> >>> wrote:
>> >>> > So I think it's time to explain what I'm writing just so everyone
>> >>> > has
>> >>> > more
>> >>> > situation awareness.  Its just a timestamp server, nothing fancy.
>> >>> >
>> >>> > Looks like this:
>> >>> >
>> >>> > public interface TimestampService {
>> >>> >     /**
>> >>> >      * This will get a fresh timestamp that is guarenteed to be
>> >>> > newer
>> >>> > than
>> >>> > any other timestamp
>> >>> >      * handed out before this method was called.
>> >>> >      */
>> >>> >     long getFreshTimestamp();
>> >>> > }
>> >>> >
>> >>> > The only requirement is that the timestamp handed back is greater
>> >>> > than
>> >>> > every
>> >>> > other timestamp that was returned before getFreshTs was called.
>> >>> > There
>> >>> > is no
>> >>> > ordering requirement for concurrent requests.
>> >>> >
>> >>> > My impl is to reserve blocks of timestamps that are safe to hand out
>> >>> > (1M at
>> >>> > a time) using compare and swap in ZK.
>> >>> > lastPossibleUsed = read(HighWater)
>> >>> > safeToHandout = compareAndSwap(lastPossibleUsed,
>> >>> > lastPossibleUsed+1M)
>> >>> >
>> >>> > Now my leader can hand back timestamps up to safeToHandout, but
>> >>> > before
>> >>> > it
>> >>> > hands one out it must ensure it is still the leader (no one else has
>> >>> > handed
>> >>> > back something higher).
>> >>> > I can use ensureQuorum(), exists(myEphemNode) to make sure this is
>> >>> > the
>> >>> > case.
>> >>> > Now I have a service that is guarenteed to be correct, but doesn't
>> >>> > require
>> >>> > disk hits in the steady state which brings down my latency (if you
>> >>> > get
>> >>> > close
>> >>> > to running out, you can compareAndSwap for more timestamps).
>> >>> >
>> >>> > If many requests come in at the same time I can use smart batching
>> >>> > to
>> >>> > verify
>> >>> > happens after for all at once.  We can also add more layers if we
>> >>> > need
>> >>> > more
>> >>> > bandwidth to scale up at the cost of adding latency.  Basically our
>> >>> > latency
>> >>> > will be O(lg(requestRate)) if we keep adding layers as each previous
>> >>> > layer
>> >>> > becomes saturated.
>> >>> >
>> >>> > I hope this explanation helps. I am busy for the next 4 hours, but
>> >>> > if
>> >>> > you
>> >>> > need more clarification I can respond to them at that time.
>> >>> >
>> >>> > -jc
>> >>> >
>> >>> >
>> >>> > On Thu, Sep 27, 2012 at 9:26 AM, John Carrino
>> >>> > <[email protected]>
>> >>> > wrote:
>> >>> >>
>> >>> >> First, thanks everyone for talking this through with me.
>> >>> >>
>> >>> >> Flavio, for your example, this is actually ok.  There is a happens
>> >>> >> after
>> >>> >> relationship between the client making the request and my leader C1
>> >>> >> still
>> >>> >> being the leader.  My service only needs to guarantee that what it
>> >>> >> hands
>> >>> >> back is at least as new as anything that existed when the client
>> >>> >> made
>> >>> >> the
>> >>> >> request.  If C2 were to answer requests while C1 is stalling that
>> >>> >> is
>> >>> >> ok
>> >>> >> because these would be considered concurrent requests and the stuff
>> >>> >> returned
>> >>> >> by C2 may be newer but that doesn't violate any guarentees.
>> >>> >>
>> >>> >> If some client were to get back something from C2 and then (happens
>> >>> >> after
>> >>> >> relationship) someone tried to read from C1, it needs to fail.
>> >>> >>
>> >>> >> To address your concern of adding too much bandwidth we can get
>> >>> >> this
>> >>> >> easily by doing what Martin Thompson calls smart batching
>> >>> >>
>> >>> >> (http://mechanical-sympathy.blogspot.com/2011/10/smart-batching.html).
>> >>> >>
>> >>> >> 1. ensureQuorum request comes in to L1
>> >>> >> 2. send ENSURE to all followers
>> >>> >> 3. 10 more ensureQuorum requests come in
>> >>> >> 4. get back ENSURE from quorum
>> >>> >> 5. we can now service all 10 pending ensureQuorum requests with
>> >>> >> another
>> >>> >> round trip ENSURE.
>> >>> >>
>> >>> >> We don't need to send an ENSURE for every ensureQuorum request, we
>> >>> >> just
>> >>> >> need it to be happens after from when the request arrived.
>> >>> >>
>> >>> >> I am fine with the Ephemeral node being removed after some time
>> >>> >> expires,
>> >>> >> but only by the leader.  If the leaders clock is broken and the
>> >>> >> client
>> >>> >> owning the Ephemeral node drops off, then we don't have liveness
>> >>> >> (because
>> >>> >> this node may not get cleaned up in a timely fashion).  However, we
>> >>> >> still
>> >>> >> preserve corectness.
>> >>> >>
>> >>> >> -jc
>> >>> >>
>> >>> >>
>> >>> >> On Thu, Sep 27, 2012 at 9:02 AM, Flavio Junqueira
>> >>> >> <[email protected]>
>> >>> >> wrote:
>> >>> >>>
>> >>> >>> Say that we implement what you're suggesting. Could you check if
>> >>> >>> this
>> >>> >>> scenario can happen:
>> >>> >>>
>> >>> >>> 1- Client C1 is the current leader and it super boosted read to
>> >>> >>> make
>> >>> >>> sure
>> >>> >>> it is still the leader;
>> >>> >>> 2- We process the super boosted read having it through the zab
>> >>> >>> pipeline;
>> >>> >>> 3- When we send the response to C1 we slow down the whole deal:
>> >>> >>> the
>> >>> >>> response to C1 gets delayed and we stall C1;
>> >>> >>> 4- In the meanwhile, C1's session expires on the server side and
>> >>> >>> its
>> >>> >>> ephemeral leadership node is removed;
>> >>> >>> 5- A new client C2 is elected and starts exercising leadership;
>> >>> >>> 6- Now C1 comes back to normal and receives the response of the
>> >>> >>> super
>> >>> >>> boosted read saying that it is still the leader.
>> >>> >>>
>> >>> >>> If my interpretation is not incorrect, the only way to prevent
>> >>> >>> this
>> >>> >>> scenario from happening is if the session expires on the client
>> >>> >>> side
>> >>> >>> before
>> >>> >>> it receives the response of the read. It doesn't look like we can
>> >>> >>> do
>> >>> >>> it if
>> >>> >>> process clocks can be arbitrarily delayed.
>> >>> >>>
>> >>> >>> Note that one issue is that the behavior of ephemerals is highly
>> >>> >>> dependent upon timers, so I don't think we can avoid making some
>> >>> >>> timing
>> >>> >>> assumptions altogether. The question is if we are better off with
>> >>> >>> a
>> >>> >>> mechanism relying upon acknowledgements. My sense is that
>> >>> >>> application-level
>> >>> >>> fencing is preferable (if not necessary) for applications like the
>> >>> >>> ones JC
>> >>> >>> is mentioning or BookKeeper.
>> >>> >>>
>> >>> >>> I'm not concerned about writes to disk, which I agree we don't
>> >>> >>> need
>> >>> >>> for
>> >>> >>> sync. I'm more concerned about having it going through the whole
>> >>> >>> pipeline,
>> >>> >>> which will induce more traffic to zab and increase latency for an
>> >>> >>> application that uses it heavily.
>> >>> >>>
>> >>> >>> -Flavio
>> >>> >>>
>> >>> >>> On Sep 27, 2012, at 5:27 PM, Alexander Shraer wrote:
>> >>> >>>
>> >>> >>> > another idea is to add this functionality to MultiOp - have read
>> >>> >>> > only
>> >>> >>> > transactions be replicated but not logged or logged
>> >>> >>> > asynchronously.
>> >>> >>> > I'm not sure how it works right now if I do a read-only MultiOp
>> >>> >>> > transaction - does it replicate the transaction or answer it
>> >>> >>> > locally
>> >>> >>> > on the leader ?
>> >>> >>> >
>> >>> >>> > Alex
>> >>> >>> >
>> >>> >>> > On Thu, Sep 27, 2012 at 8:07 AM, Alexander Shraer
>> >>> >>> > <[email protected]>
>> >>> >>> > wrote:
>> >>> >>> >> Thanks for the explanation.
>> >>> >>> >>
>> >>> >>> >> I guess one could always invoke a write operation instead of
>> >>> >>> >> sync
>> >>> >>> >> to
>> >>> >>> >> get the more strict semantics, but as John suggests, it might
>> >>> >>> >> be a
>> >>> >>> >> good idea to add a new type of operation that requires
>> >>> >>> >> followers
>> >>> >>> >> to
>> >>> >>> >> ack but doesn't require them to log to disk - this seems
>> >>> >>> >> sufficient in
>> >>> >>> >> our case.
>> >>> >>> >>
>> >>> >>> >> Alex
>> >>> >>> >>
>> >>> >>> >> On Thu, Sep 27, 2012 at 3:56 AM, Flavio Junqueira
>> >>> >>> >> <[email protected]>
>> >>> >>> >> wrote:
>> >>> >>> >>> In theory, the scenario you're describing could happen, but I
>> >>> >>> >>> would
>> >>> >>> >>> argue that it is unlikely given that: 1) a leader pings
>> >>> >>> >>> followers
>> >>> >>> >>> twice a
>> >>> >>> >>> tick to make sure that it has a quorum of supporters (lead());
>> >>> >>> >>> 2)
>> >>> >>> >>> followers
>> >>> >>> >>> give up on a leader upon catching an exception
>> >>> >>> >>> (followLeader()).
>> >>> >>> >>> One could
>> >>> >>> >>> calibrate tickTime to make the probability of having this
>> >>> >>> >>> scenario low.
>> >>> >>> >>>
>> >>> >>> >>> Let me also revisit the motivation for the way we designed
>> >>> >>> >>> sync.
>> >>> >>> >>> ZooKeeper has been designed to serve reads efficiently and
>> >>> >>> >>> making
>> >>> >>> >>> sync go
>> >>> >>> >>> through the pipeline would slow down reads. Although optional,
>> >>> >>> >>> we
>> >>> >>> >>> thought it
>> >>> >>> >>> would be a good idea to make it as efficient as possible to
>> >>> >>> >>> comply with the
>> >>> >>> >>> original expectations for the service. We consequently came up
>> >>> >>> >>> with this
>> >>> >>> >>> cheap way of making sure that a read sees all pending updates.
>> >>> >>> >>> It
>> >>> >>> >>> is correct
>> >>> >>> >>> that there are some corner cases that it doesn't cover. One is
>> >>> >>> >>> the case you
>> >>> >>> >>> mentioned. Another is having the sync finishing before the
>> >>> >>> >>> client
>> >>> >>> >>> submits
>> >>> >>> >>> the read and having a write committing in between. We rely
>> >>> >>> >>> upon
>> >>> >>> >>> the way we
>> >>> >>> >>> implement timeouts and some minimum degree of synchrony for
>> >>> >>> >>> the
>> >>> >>> >>> clients when
>> >>> >>> >>> submitting operations to guarantee that the scheme work.
>> >>> >>> >>>
>> >>> >>> >>> We thought about the option of having the sync operation going
>> >>> >>> >>> through the pipeline, and in fact it would have been easier to
>> >>> >>> >>> implement it
>> >>> >>> >>> just as a regular write, but we opted not to because we felt
>> >>> >>> >>> it
>> >>> >>> >>> was
>> >>> >>> >>> sufficient for the use cases we had and more efficient as I
>> >>> >>> >>> already argued.
>> >>> >>> >>>
>> >>> >>> >>> Hope it helps to clarify.
>> >>> >>> >>>
>> >>> >>> >>> -Flavio
>> >>> >>> >>>
>> >>> >>> >>> On Sep 27, 2012, at 9:38 AM, Alexander Shraer wrote:
>> >>> >>> >>>
>> >>> >>> >>>> thanks for the explanation! but how do you avoid having the
>> >>> >>> >>>> scenario
>> >>> >>> >>>> raised by John ?
>> >>> >>> >>>> lets say you're a client connected to F, and F is connected
>> >>> >>> >>>> to
>> >>> >>> >>>> L.
>> >>> >>> >>>> Lets
>> >>> >>> >>>> also say that L's pipeline
>> >>> >>> >>>> is now empty, and both F and L are partitioned from 3 other
>> >>> >>> >>>> servers
>> >>> >>> >>>> in
>> >>> >>> >>>> the system that have already
>> >>> >>> >>>> elected a new leader L'. Now I go to L' and write something.
>> >>> >>> >>>> L
>> >>> >>> >>>> still
>> >>> >>> >>>> thinks its the leader because the
>> >>> >>> >>>> detection that followers left it is obviously timeout
>> >>> >>> >>>> dependent.
>> >>> >>> >>>> So
>> >>> >>> >>>> when F sends your sync to L and L returns
>> >>> >>> >>>> it to F, you actually miss my write!
>> >>> >>> >>>>
>> >>> >>> >>>> Alex
>> >>> >>> >>>>
>> >>> >>> >>>> On Thu, Sep 27, 2012 at 12:32 AM, Flavio Junqueira
>> >>> >>> >>>> <[email protected]> wrote:
>> >>> >>> >>>>> Hi Alex, Because of the following:
>> >>> >>> >>>>>
>> >>> >>> >>>>> 1- A follower F processes operations from a client in FIFO
>> >>> >>> >>>>> order,
>> >>> >>> >>>>> and say that a client submits as you say sync + read;
>> >>> >>> >>>>> 2- A sync will be processed by the leader and returned to
>> >>> >>> >>>>> the
>> >>> >>> >>>>> follower. It will be queued after all pending updates that
>> >>> >>> >>>>> the
>> >>> >>> >>>>> follower
>> >>> >>> >>>>> hasn't processed;
>> >>> >>> >>>>> 3- The follower will process all pending updates before
>> >>> >>> >>>>> processing
>> >>> >>> >>>>> the response of the sync;
>> >>> >>> >>>>> 4- Once the follower processes the sync, it picks the read
>> >>> >>> >>>>> operation to process. It reads the local state of the
>> >>> >>> >>>>> follower
>> >>> >>> >>>>> and returns
>> >>> >>> >>>>> to the client.
>> >>> >>> >>>>>
>> >>> >>> >>>>> When we process the read in Step 4, we have applied all
>> >>> >>> >>>>> pending
>> >>> >>> >>>>> updates the leader had for the follower by the time the read
>> >>> >>> >>>>> request
>> >>> >>> >>>>> started.
>> >>> >>> >>>>>
>> >>> >>> >>>>> This implementation is a bit of a hack because it doesn't
>> >>> >>> >>>>> follow
>> >>> >>> >>>>> the same code path as the other operations that go to the
>> >>> >>> >>>>> leader, but it
>> >>> >>> >>>>> avoids some unnecessary steps, which is important for fast
>> >>> >>> >>>>> reads. In the
>> >>> >>> >>>>> sync case, the other followers don't really need to know
>> >>> >>> >>>>> about
>> >>> >>> >>>>> it (there is
>> >>> >>> >>>>> nothing to be updated) and the leader simply inserts it in
>> >>> >>> >>>>> the
>> >>> >>> >>>>> sequence of
>> >>> >>> >>>>> updates of F, ordering it.
>> >>> >>> >>>>>
>> >>> >>> >>>>> -Flavio
>> >>> >>> >>>>>
>> >>> >>> >>>>> On Sep 27, 2012, at 9:12 AM, Alexander Shraer wrote:
>> >>> >>> >>>>>
>> >>> >>> >>>>>> Hi Flavio,
>> >>> >>> >>>>>>
>> >>> >>> >>>>>>> Starting a read operation concurrently with a sync implies
>> >>> >>> >>>>>>> that
>> >>> >>> >>>>>>> the result of the read will not miss an update committed
>> >>> >>> >>>>>>> before the read
>> >>> >>> >>>>>>> started.
>> >>> >>> >>>>>>
>> >>> >>> >>>>>> I thought that the intention of sync was to give something
>> >>> >>> >>>>>> like
>> >>> >>> >>>>>> linearizable reads, so if you invoke a sync and then a
>> >>> >>> >>>>>> read,
>> >>> >>> >>>>>> your
>> >>> >>> >>>>>> read
>> >>> >>> >>>>>> is guaranteed to (at least) see any write which completed
>> >>> >>> >>>>>> before
>> >>> >>> >>>>>> the
>> >>> >>> >>>>>> sync began. Is this the intention ? If so, how is this
>> >>> >>> >>>>>> achieved
>> >>> >>> >>>>>> without running agreement on the sync op ?
>> >>> >>> >>>>>>
>> >>> >>> >>>>>> Thanks,
>> >>> >>> >>>>>> Alex
>> >>> >>> >>>>>>
>> >>> >>> >>>>>> On Thu, Sep 27, 2012 at 12:05 AM, Flavio Junqueira
>> >>> >>> >>>>>> <[email protected]> wrote:
>> >>> >>> >>>>>>> sync simply flushes the channel between the leader and the
>> >>> >>> >>>>>>> follower that forwarded the sync operation, so it doesn't
>> >>> >>> >>>>>>> go
>> >>> >>> >>>>>>> through the
>> >>> >>> >>>>>>> full zab pipeline. Flushing means that all pending updates
>> >>> >>> >>>>>>> from the leader
>> >>> >>> >>>>>>> to the follower are received by the time sync completes.
>> >>> >>> >>>>>>> Starting a read
>> >>> >>> >>>>>>> operation concurrently with a sync implies that the result
>> >>> >>> >>>>>>> of
>> >>> >>> >>>>>>> the read will
>> >>> >>> >>>>>>> not miss an update committed before the read started.
>> >>> >>> >>>>>>>
>> >>> >>> >>>>>>> -Flavio
>> >>> >>> >>>>>>>
>> >>> >>> >>>>>>> On Sep 27, 2012, at 3:43 AM, Alexander Shraer wrote:
>> >>> >>> >>>>>>>
>> >>> >>> >>>>>>>> Its strange that sync doesn't run through agreement, I
>> >>> >>> >>>>>>>> was
>> >>> >>> >>>>>>>> always
>> >>> >>> >>>>>>>> assuming that it is... Exactly for the reason you say -
>> >>> >>> >>>>>>>> you may trust your leader, but I may have a different
>> >>> >>> >>>>>>>> leader
>> >>> >>> >>>>>>>> and
>> >>> >>> >>>>>>>> your
>> >>> >>> >>>>>>>> leader may not detect it yet and still think its the
>> >>> >>> >>>>>>>> leader.
>> >>> >>> >>>>>>>>
>> >>> >>> >>>>>>>> This seems like a bug to me.
>> >>> >>> >>>>>>>>
>> >>> >>> >>>>>>>> Similarly to Paxos, Zookeeper's safety guarantees don't
>> >>> >>> >>>>>>>> (or
>> >>> >>> >>>>>>>> shouldn't)
>> >>> >>> >>>>>>>> depend on timing assumption.
>> >>> >>> >>>>>>>> Only progress guarantees depend on time.
>> >>> >>> >>>>>>>>
>> >>> >>> >>>>>>>> Alex
>> >>> >>> >>>>>>>>
>> >>> >>> >>>>>>>>
>> >>> >>> >>>>>>>> On Wed, Sep 26, 2012 at 4:41 PM, John Carrino
>> >>> >>> >>>>>>>> <[email protected]> wrote:
>> >>> >>> >>>>>>>>> I have some pretty strong requirements in terms of
>> >>> >>> >>>>>>>>> consistency
>> >>> >>> >>>>>>>>> where
>> >>> >>> >>>>>>>>> reading from followers that may be behind in terms of
>> >>> >>> >>>>>>>>> updates
>> >>> >>> >>>>>>>>> isn't ok for
>> >>> >>> >>>>>>>>> my use case.
>> >>> >>> >>>>>>>>>
>> >>> >>> >>>>>>>>> One error case that worries me is if a follower and
>> >>> >>> >>>>>>>>> leader
>> >>> >>> >>>>>>>>> are
>> >>> >>> >>>>>>>>> partitioned
>> >>> >>> >>>>>>>>> off from the network.  A new leader is elected, but the
>> >>> >>> >>>>>>>>> follower and old
>> >>> >>> >>>>>>>>> leader don't know about it.
>> >>> >>> >>>>>>>>>
>> >>> >>> >>>>>>>>> Normally I think sync was made for this purpost, but I
>> >>> >>> >>>>>>>>> looked
>> >>> >>> >>>>>>>>> at the sync
>> >>> >>> >>>>>>>>> code and if there aren't any outstanding proposals the
>> >>> >>> >>>>>>>>> leader
>> >>> >>> >>>>>>>>> sends the
>> >>> >>> >>>>>>>>> sync right back to the client without first verifying
>> >>> >>> >>>>>>>>> that
>> >>> >>> >>>>>>>>> it
>> >>> >>> >>>>>>>>> still has
>> >>> >>> >>>>>>>>> quorum, so this won't work for my use case.
>> >>> >>> >>>>>>>>>
>> >>> >>> >>>>>>>>> At the core of the issue all I really need is a call
>> >>> >>> >>>>>>>>> that
>> >>> >>> >>>>>>>>> will
>> >>> >>> >>>>>>>>> make it's
>> >>> >>> >>>>>>>>> way to the leader and will ping it's followers, ensure
>> >>> >>> >>>>>>>>> it
>> >>> >>> >>>>>>>>> still
>> >>> >>> >>>>>>>>> has a
>> >>> >>> >>>>>>>>> quorum and return success.
>> >>> >>> >>>>>>>>>
>> >>> >>> >>>>>>>>> Basically a getCurrentLeaderEpoch() method that will be
>> >>> >>> >>>>>>>>> forwarded to the
>> >>> >>> >>>>>>>>> leader, leader will ensure it still has quorum and
>> >>> >>> >>>>>>>>> return
>> >>> >>> >>>>>>>>> it's
>> >>> >>> >>>>>>>>> epoch.  I
>> >>> >>> >>>>>>>>> can use this primitive to implement all the other
>> >>> >>> >>>>>>>>> properties I
>> >>> >>> >>>>>>>>> want to
>> >>> >>> >>>>>>>>> verify (assuming that my client will never connect to an
>> >>> >>> >>>>>>>>> older
>> >>> >>> >>>>>>>>> epoch after
>> >>> >>> >>>>>>>>> this call returns). Also the nice thing about this
>> >>> >>> >>>>>>>>> method
>> >>> >>> >>>>>>>>> is
>> >>> >>> >>>>>>>>> that it will
>> >>> >>> >>>>>>>>> not have to hit disk and the latency should just be a
>> >>> >>> >>>>>>>>> round
>> >>> >>> >>>>>>>>> trip to the
>> >>> >>> >>>>>>>>> followers.
>> >>> >>> >>>>>>>>>
>> >>> >>> >>>>>>>>> Most of the guarentees offered by zookeeper are time
>> >>> >>> >>>>>>>>> based
>> >>> >>> >>>>>>>>> an
>> >>> >>> >>>>>>>>> rely on
>> >>> >>> >>>>>>>>> clocks and expiring timers, but I'm hoping to offer some
>> >>> >>> >>>>>>>>> guarantees in
>> >>> >>> >>>>>>>>> spite of busted clocks, horrible GC perf, VM suspends
>> >>> >>> >>>>>>>>> and
>> >>> >>> >>>>>>>>> any
>> >>> >>> >>>>>>>>> other way
>> >>> >>> >>>>>>>>> time is broken.
>> >>> >>> >>>>>>>>>
>> >>> >>> >>>>>>>>> Also if people are interested I can go into more detail
>> >>> >>> >>>>>>>>> about
>> >>> >>> >>>>>>>>> what I am
>> >>> >>> >>>>>>>>> trying to write.
>> >>> >>> >>>>>>>>>
>> >>> >>> >>>>>>>>> -jc
>> >>> >>> >>>>>>>
>> >>> >>> >>>>>
>> >>> >>> >>>
>> >>> >>>
>> >>> >>
>> >>> >
>> >>
>> >>
>> >
>
>

Reply via email to