I checked your message again and the contract is that getFreshTimestamp() always gets a fresh timestamp (not older than any returned by that method). One aspect of the contract that is not clear to me: does this freshness property hold across clients or only for a single client? If it holds only per client, then you can guarantee it with master epochs. No user of your application will connect to the master of epoch y < x once it connects to the master of epoch x. The epochs can be the sequence number of a sequential znode. How does it sound?
-Flavio On Sep 28, 2012, at 5:00 PM, Flavio Junqueira wrote: > I was thinking that you can do a write per timestamp batch, and not per > individual timestamp. In the worst case, a former leader won't use some > timestamps, and I would think it is ok, but it depends on your application. > > Also, even if two clients believe they are leaders simultaneously and serve > timestamps, the property that you seem to care about the most is uniqueness: > a timestamp is not served twice. Order of timestamps would be preserved per > leader, but in the case of overlapping leaders you could end up serving > timestamps that do not follow an increasing order. > > -Flavio > > On Sep 28, 2012, at 4:37 PM, John Carrino wrote: > >> CompareAndSwap is an atomic check and update. Basically only update the >> value if it is the same as the expected value. >> >> I think with your approach you'd have to do a write for every single >> timestamp you wanted to hand out. The latency hit on this would too much. >> >> My approach is different in that a timestamp server reserves a bunch of >> timestamps up front and proceeds to hand them out as long as it is the >> leader. Leader check can be done without hitting disk hopefully. >> >> Thanks! >> >> -jc >> >> >> On Fri, Sep 28, 2012 at 7:19 AM, Flavio Junqueira <[email protected]> wrote: >> >>> I don't know what your compareAndSwap method does, but I was wondering if >>> your client process can use conditional writes to a znode to make sure that >>> it was the last one to update the state of timestamp batches. You can treat >>> the master election problem separately and it does not have to be as strict >>> as you have been thinking you need. Thats is, it wouldn't hurt if a client >>> still thinks it is leading even if it is not because no two clients will be >>> able to update the state of timestamp blocks without noticing that another >>> client is also updating it. >>> >>> -Flavio >>> >>> On Sep 27, 2012, at 6:57 PM, John Carrino wrote: >>> >>>> So I think it's time to explain what I'm writing just so everyone has >>>> more situation awareness. Its just a timestamp server, nothing fancy. >>>> >>>> Looks like this: >>>> >>>> public interface TimestampService { >>>> /** >>>> * This will get a fresh timestamp that is guarenteed to be newer than >>>> any other timestamp >>>> * handed out before this method was called. >>>> */ >>>> long getFreshTimestamp(); >>>> } >>>> >>>> The only requirement is that the timestamp handed back is greater than >>>> every other timestamp that was returned before getFreshTs was called. >>>> There is no ordering requirement for concurrent requests. >>>> >>>> My impl is to reserve blocks of timestamps that are safe to hand out (1M >>> at >>>> a time) using compare and swap in ZK. >>>> lastPossibleUsed = read(HighWater) >>>> safeToHandout = compareAndSwap(lastPossibleUsed, lastPossibleUsed+1M) >>>> >>>> Now my leader can hand back timestamps up to safeToHandout, but before it >>>> hands one out it must ensure it is still the leader (no one else has >>> handed >>>> back something higher). >>>> I can use ensureQuorum(), exists(myEphemNode) to make sure this is the >>>> case. Now I have a service that is guarenteed to be correct, but doesn't >>>> require disk hits in the steady state which brings down my latency (if >>> you >>>> get close to running out, you can compareAndSwap for more timestamps). >>>> >>>> If many requests come in at the same time I can use smart batching to >>>> verify happens after for all at once. We can also add more layers if we >>>> need more bandwidth to scale up at the cost of adding latency. Basically >>>> our latency will be O(lg(requestRate)) if we keep adding layers as each >>>> previous layer becomes saturated. >>>> >>>> I hope this explanation helps. I am busy for the next 4 hours, but if you >>>> need more clarification I can respond to them at that time. >>>> >>>> -jc >>>> >>>> >>>> On Thu, Sep 27, 2012 at 9:26 AM, John Carrino <[email protected] >>>> wrote: >>>> >>>>> First, thanks everyone for talking this through with me. >>>>> >>>>> Flavio, for your example, this is actually ok. There is a happens after >>>>> relationship between the client making the request and my leader C1 >>> still >>>>> being the leader. My service only needs to guarantee that what it hands >>>>> back is at least as new as anything that existed when the client made >>> the >>>>> request. If C2 were to answer requests while C1 is stalling that is ok >>>>> because these would be considered concurrent requests and the stuff >>>>> returned by C2 may be newer but that doesn't violate any guarentees. >>>>> >>>>> If some client were to get back something from C2 and then (happens >>> after >>>>> relationship) someone tried to read from C1, it needs to fail. >>>>> >>>>> To address your concern of adding too much bandwidth we can get this >>>>> easily by doing what Martin Thompson calls smart batching ( >>>>> http://mechanical-sympathy.blogspot.com/2011/10/smart-batching.html). >>>>> >>>>> 1. ensureQuorum request comes in to L1 >>>>> 2. send ENSURE to all followers >>>>> 3. 10 more ensureQuorum requests come in >>>>> 4. get back ENSURE from quorum >>>>> 5. we can now service all 10 pending ensureQuorum requests with another >>>>> round trip ENSURE. >>>>> >>>>> We don't need to send an ENSURE for every ensureQuorum request, we just >>>>> need it to be happens after from when the request arrived. >>>>> >>>>> I am fine with the Ephemeral node being removed after some time expires, >>>>> but only by the leader. If the leaders clock is broken and the client >>>>> owning the Ephemeral node drops off, then we don't have liveness >>> (because >>>>> this node may not get cleaned up in a timely fashion). However, we >>> still >>>>> preserve corectness. >>>>> >>>>> -jc >>>>> >>>>> >>>>> On Thu, Sep 27, 2012 at 9:02 AM, Flavio Junqueira <[email protected] >>>> wrote: >>>>> >>>>>> Say that we implement what you're suggesting. Could you check if this >>>>>> scenario can happen: >>>>>> >>>>>> 1- Client C1 is the current leader and it super boosted read to make >>> sure >>>>>> it is still the leader; >>>>>> 2- We process the super boosted read having it through the zab >>> pipeline; >>>>>> 3- When we send the response to C1 we slow down the whole deal: the >>>>>> response to C1 gets delayed and we stall C1; >>>>>> 4- In the meanwhile, C1's session expires on the server side and its >>>>>> ephemeral leadership node is removed; >>>>>> 5- A new client C2 is elected and starts exercising leadership; >>>>>> 6- Now C1 comes back to normal and receives the response of the super >>>>>> boosted read saying that it is still the leader. >>>>>> >>>>>> If my interpretation is not incorrect, the only way to prevent this >>>>>> scenario from happening is if the session expires on the client side >>> before >>>>>> it receives the response of the read. It doesn't look like we can do >>> it if >>>>>> process clocks can be arbitrarily delayed. >>>>>> >>>>>> Note that one issue is that the behavior of ephemerals is highly >>>>>> dependent upon timers, so I don't think we can avoid making some timing >>>>>> assumptions altogether. The question is if we are better off with a >>>>>> mechanism relying upon acknowledgements. My sense is that >>> application-level >>>>>> fencing is preferable (if not necessary) for applications like the >>> ones JC >>>>>> is mentioning or BookKeeper. >>>>>> >>>>>> I'm not concerned about writes to disk, which I agree we don't need for >>>>>> sync. I'm more concerned about having it going through the whole >>> pipeline, >>>>>> which will induce more traffic to zab and increase latency for an >>>>>> application that uses it heavily. >>>>>> >>>>>> -Flavio >>>>>> >>>>>> On Sep 27, 2012, at 5:27 PM, Alexander Shraer wrote: >>>>>> >>>>>>> another idea is to add this functionality to MultiOp - have read only >>>>>>> transactions be replicated but not logged or logged asynchronously. >>>>>>> I'm not sure how it works right now if I do a read-only MultiOp >>>>>>> transaction - does it replicate the transaction or answer it locally >>>>>>> on the leader ? >>>>>>> >>>>>>> Alex >>>>>>> >>>>>>> On Thu, Sep 27, 2012 at 8:07 AM, Alexander Shraer <[email protected]> >>>>>> wrote: >>>>>>>> Thanks for the explanation. >>>>>>>> >>>>>>>> I guess one could always invoke a write operation instead of sync to >>>>>>>> get the more strict semantics, but as John suggests, it might be a >>>>>>>> good idea to add a new type of operation that requires followers to >>>>>>>> ack but doesn't require them to log to disk - this seems sufficient >>> in >>>>>>>> our case. >>>>>>>> >>>>>>>> Alex >>>>>>>> >>>>>>>> On Thu, Sep 27, 2012 at 3:56 AM, Flavio Junqueira <[email protected] >>>> >>>>>> wrote: >>>>>>>>> In theory, the scenario you're describing could happen, but I would >>>>>> argue that it is unlikely given that: 1) a leader pings followers >>> twice a >>>>>> tick to make sure that it has a quorum of supporters (lead()); 2) >>> followers >>>>>> give up on a leader upon catching an exception (followLeader()). One >>> could >>>>>> calibrate tickTime to make the probability of having this scenario low. >>>>>>>>> >>>>>>>>> Let me also revisit the motivation for the way we designed sync. >>>>>> ZooKeeper has been designed to serve reads efficiently and making sync >>> go >>>>>> through the pipeline would slow down reads. Although optional, we >>> thought >>>>>> it would be a good idea to make it as efficient as possible to comply >>> with >>>>>> the original expectations for the service. We consequently came up with >>>>>> this cheap way of making sure that a read sees all pending updates. It >>> is >>>>>> correct that there are some corner cases that it doesn't cover. One is >>> the >>>>>> case you mentioned. Another is having the sync finishing before the >>> client >>>>>> submits the read and having a write committing in between. We rely >>> upon the >>>>>> way we implement timeouts and some minimum degree of synchrony for the >>>>>> clients when submitting operations to guarantee that the scheme work. >>>>>>>>> >>>>>>>>> We thought about the option of having the sync operation going >>>>>> through the pipeline, and in fact it would have been easier to >>> implement it >>>>>> just as a regular write, but we opted not to because we felt it was >>>>>> sufficient for the use cases we had and more efficient as I already >>> argued. >>>>>>>>> >>>>>>>>> Hope it helps to clarify. >>>>>>>>> >>>>>>>>> -Flavio >>>>>>>>> >>>>>>>>> On Sep 27, 2012, at 9:38 AM, Alexander Shraer wrote: >>>>>>>>> >>>>>>>>>> thanks for the explanation! but how do you avoid having the >>> scenario >>>>>>>>>> raised by John ? >>>>>>>>>> lets say you're a client connected to F, and F is connected to L. >>>>>> Lets >>>>>>>>>> also say that L's pipeline >>>>>>>>>> is now empty, and both F and L are partitioned from 3 other servers >>>>>> in >>>>>>>>>> the system that have already >>>>>>>>>> elected a new leader L'. Now I go to L' and write something. L >>> still >>>>>>>>>> thinks its the leader because the >>>>>>>>>> detection that followers left it is obviously timeout dependent. So >>>>>>>>>> when F sends your sync to L and L returns >>>>>>>>>> it to F, you actually miss my write! >>>>>>>>>> >>>>>>>>>> Alex >>>>>>>>>> >>>>>>>>>> On Thu, Sep 27, 2012 at 12:32 AM, Flavio Junqueira < >>>>>> [email protected]> wrote: >>>>>>>>>>> Hi Alex, Because of the following: >>>>>>>>>>> >>>>>>>>>>> 1- A follower F processes operations from a client in FIFO order, >>>>>> and say that a client submits as you say sync + read; >>>>>>>>>>> 2- A sync will be processed by the leader and returned to the >>>>>> follower. It will be queued after all pending updates that the follower >>>>>> hasn't processed; >>>>>>>>>>> 3- The follower will process all pending updates before processing >>>>>> the response of the sync; >>>>>>>>>>> 4- Once the follower processes the sync, it picks the read >>>>>> operation to process. It reads the local state of the follower and >>> returns >>>>>> to the client. >>>>>>>>>>> >>>>>>>>>>> When we process the read in Step 4, we have applied all pending >>>>>> updates the leader had for the follower by the time the read request >>>>>> started. >>>>>>>>>>> >>>>>>>>>>> This implementation is a bit of a hack because it doesn't follow >>>>>> the same code path as the other operations that go to the leader, but >>> it >>>>>> avoids some unnecessary steps, which is important for fast reads. In >>> the >>>>>> sync case, the other followers don't really need to know about it >>> (there is >>>>>> nothing to be updated) and the leader simply inserts it in the >>> sequence of >>>>>> updates of F, ordering it. >>>>>>>>>>> >>>>>>>>>>> -Flavio >>>>>>>>>>> >>>>>>>>>>> On Sep 27, 2012, at 9:12 AM, Alexander Shraer wrote: >>>>>>>>>>> >>>>>>>>>>>> Hi Flavio, >>>>>>>>>>>> >>>>>>>>>>>>> Starting a read operation concurrently with a sync implies that >>>>>> the result of the read will not miss an update committed before the >>> read >>>>>> started. >>>>>>>>>>>> >>>>>>>>>>>> I thought that the intention of sync was to give something like >>>>>>>>>>>> linearizable reads, so if you invoke a sync and then a read, your >>>>>> read >>>>>>>>>>>> is guaranteed to (at least) see any write which completed before >>>>>> the >>>>>>>>>>>> sync began. Is this the intention ? If so, how is this achieved >>>>>>>>>>>> without running agreement on the sync op ? >>>>>>>>>>>> >>>>>>>>>>>> Thanks, >>>>>>>>>>>> Alex >>>>>>>>>>>> >>>>>>>>>>>> On Thu, Sep 27, 2012 at 12:05 AM, Flavio Junqueira < >>>>>> [email protected]> wrote: >>>>>>>>>>>>> sync simply flushes the channel between the leader and the >>>>>> follower that forwarded the sync operation, so it doesn't go through >>> the >>>>>> full zab pipeline. Flushing means that all pending updates from the >>> leader >>>>>> to the follower are received by the time sync completes. Starting a >>> read >>>>>> operation concurrently with a sync implies that the result of the read >>> will >>>>>> not miss an update committed before the read started. >>>>>>>>>>>>> >>>>>>>>>>>>> -Flavio >>>>>>>>>>>>> >>>>>>>>>>>>> On Sep 27, 2012, at 3:43 AM, Alexander Shraer wrote: >>>>>>>>>>>>> >>>>>>>>>>>>>> Its strange that sync doesn't run through agreement, I was >>> always >>>>>>>>>>>>>> assuming that it is... Exactly for the reason you say - >>>>>>>>>>>>>> you may trust your leader, but I may have a different leader >>> and >>>>>> your >>>>>>>>>>>>>> leader may not detect it yet and still think its the leader. >>>>>>>>>>>>>> >>>>>>>>>>>>>> This seems like a bug to me. >>>>>>>>>>>>>> >>>>>>>>>>>>>> Similarly to Paxos, Zookeeper's safety guarantees don't (or >>>>>> shouldn't) >>>>>>>>>>>>>> depend on timing assumption. >>>>>>>>>>>>>> Only progress guarantees depend on time. >>>>>>>>>>>>>> >>>>>>>>>>>>>> Alex >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> On Wed, Sep 26, 2012 at 4:41 PM, John Carrino < >>>>>> [email protected]> wrote: >>>>>>>>>>>>>>> I have some pretty strong requirements in terms of consistency >>>>>> where >>>>>>>>>>>>>>> reading from followers that may be behind in terms of updates >>>>>> isn't ok for >>>>>>>>>>>>>>> my use case. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> One error case that worries me is if a follower and leader are >>>>>> partitioned >>>>>>>>>>>>>>> off from the network. A new leader is elected, but the >>>>>> follower and old >>>>>>>>>>>>>>> leader don't know about it. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Normally I think sync was made for this purpost, but I looked >>>>>> at the sync >>>>>>>>>>>>>>> code and if there aren't any outstanding proposals the leader >>>>>> sends the >>>>>>>>>>>>>>> sync right back to the client without first verifying that it >>>>>> still has >>>>>>>>>>>>>>> quorum, so this won't work for my use case. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> At the core of the issue all I really need is a call that will >>>>>> make it's >>>>>>>>>>>>>>> way to the leader and will ping it's followers, ensure it >>> still >>>>>> has a >>>>>>>>>>>>>>> quorum and return success. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Basically a getCurrentLeaderEpoch() method that will be >>>>>> forwarded to the >>>>>>>>>>>>>>> leader, leader will ensure it still has quorum and return it's >>>>>> epoch. I >>>>>>>>>>>>>>> can use this primitive to implement all the other properties I >>>>>> want to >>>>>>>>>>>>>>> verify (assuming that my client will never connect to an older >>>>>> epoch after >>>>>>>>>>>>>>> this call returns). Also the nice thing about this method is >>>>>> that it will >>>>>>>>>>>>>>> not have to hit disk and the latency should just be a round >>>>>> trip to the >>>>>>>>>>>>>>> followers. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Most of the guarentees offered by zookeeper are time based an >>>>>> rely on >>>>>>>>>>>>>>> clocks and expiring timers, but I'm hoping to offer some >>>>>> guarantees in >>>>>>>>>>>>>>> spite of busted clocks, horrible GC perf, VM suspends and any >>>>>> other way >>>>>>>>>>>>>>> time is broken. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Also if people are interested I can go into more detail about >>>>>> what I am >>>>>>>>>>>>>>> trying to write. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> -jc >>>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>> >>>>>> >>>>>> >>>>> >>> >>> >
