To move this along I have COUCHDB-2655 and three branches with a working solution;
https://git-wip-us.apache.org/repos/asf?p=couchdb-chttpd.git;h=b408ce5 https://git-wip-us.apache.org/repos/asf?p=couchdb-couch.git;h=7d811d3 https://git-wip-us.apache.org/repos/asf?p=couchdb-fabric.git;h=90e9691 All three branches are called 2655-r-met if you want to try this locally (and please do!) Sample output; curl -v 'foo:bar@localhost:15984/db1/doc1?is_r_met=true' {"_id":"doc1","_rev":"1-967a00dff5e02add41819138abb3284d","_r_met":true} By making it opt-in, I think we avoid all the collateral damage that Paul was concerned about. B. > On 2 Apr 2015, at 10:36, Robert Samuel Newson <[email protected]> wrote: > > > Yeah, not a bad idea. An extra query arg (akin to open_revs=all, > conflicts=true, etc) would avoid compatibility breaks and would clearly put > the onus on those supplying it to tolerate the presence of the extra reserved > field. > > +1 > > >> On 2 Apr 2015, at 10:32, Benjamin Bastian <[email protected]> wrote: >> >> What about adding an optional query parameter to indicate whether or not >> Couch should include the _r_met flag in the document body/bodies >> (defaulting to false)? That wouldn't break older clients and it'd work for >> the bulk API as well. As far as the case where there are conflicts, it >> seems like the most intuitive thing would be for the "r" in "_r_met" to >> have the same semantic meaning as the "r" in "?r=" (i.e. "?r=" means "wait >> for r copies of the same doc rev until a timeout" and "_r_met" would mean >> "we got/didn't get r copies of the same doc rev within the timeout"). >> >> Just my two cents. >> >> On Thu, Apr 2, 2015 at 1:22 AM, Robert Samuel Newson <[email protected]> >> wrote: >> >>> >>> Paul outlined his previous efforts to introduce this indication, and the >>> problems he faced doing so. Can we come up with an acceptable mechanism? >>> >>> A different status code will break a lot of users. While the http spec >>> says you can treat any 2xx code as success, plenty of libraries, etc, only >>> recognise 201 / 202 as successful write and 200 (and maybe 204, 206) for >>> reads. >>> >>> My preference is for a change that "can’t" break anyone, which I think >>> only leaves an "X-CouchDB-R-Met: 2" response header, which isn’t the most >>> pleasant thing. >>> >>> Suggestions? >>> >>> B. >>> >>> >>>> On 1 Apr 2015, at 06:55, Mutton, James <[email protected]> wrote: >>>> >>>> For at least my part of it, I agree with Adam. Bigcouch has made an >>> effort to inform in the case of a failure to apply W. I've seen it lead to >>> confusion when the same logic was not applied on R. >>>> >>>> I also agree that W and R are not binding contracts. There's no >>> agreement protocol to assure that W is met before being committed to disk. >>> But they are exposed as a blocking parameter of the request, so >>> notification being consistent appeared to me to be the best compromise (vs >>> straight up removal). >>>> >>>> </JamesM> >>>> >>>> >>>>> On Mar 31, 2015, at 13:15, Robert Newson <[email protected]> wrote: >>>>> >>>>> >>>>> If a way can be found that doesn't break things that can be sent in all >>> or most cases, sure. It's what a user can really infer from that which I >>> focused on. Not as much, I think, as users that want that info really want. >>>>> >>>>> >>>>>> On 31 Mar 2015, at 21:08, Adam Kocoloski <[email protected]> wrote: >>>>>> >>>>>> I hope we can all agree that CouchDB should inform the user when it is >>> unable to satisfy the requested read "quorum". >>>>>> >>>>>> Adam >>>>>> >>>>>>> On Mar 31, 2015, at 3:20 PM, Paul Davis <[email protected]> >>> wrote: >>>>>>> >>>>>>> Sounds like there's a bit of confusion here. >>>>>>> >>>>>>> What Nathan is asking for is the ability to have Couch respond with >>> some >>>>>>> information on the actual number of replicas that responded to a read >>>>>>> request. That way a user could tell that they issued an r=2 request >>> when >>>>>>> only r=1 was actually performed. Depending on your point of view in >>> an MVCC >>>>>>> world this is either a bug or a feature. :) >>>>>>> >>>>>>> It was generally agreed upon that if we could return this information >>> it >>>>>>> would be beneficial. Although what happened when I started >>> implementing >>>>>>> this patch was that we are either only able to return it in a subset >>> of >>>>>>> cases where it happens, return it inconsistently between various >>> responses, >>>>>>> or break replication. >>>>>>> >>>>>>> The three general methods for this would be to either include a new >>>>>>> "_r_met" key in the doc body that would be a boolean indicating if the >>>>>>> requested read quorum was actually met for the document. The second >>> was to >>>>>>> return a custom X-R-Met type header, and lastly was the status code as >>>>>>> described. >>>>>>> >>>>>>> The _r_met member was thought to be the best, but unfortunately that >>> breaks >>>>>>> replication with older clients because we throw an error rather than >>> ignore >>>>>>> any unknown underscore prefixed field name. Thus having something >>> that was >>>>>>> just dynamically injected into the document body was a non-starter. >>>>>>> Unfortunately, if we don't inject into the document body then we limit >>>>>>> ourselves to only the set of APIs where a single document is >>> returned. This >>>>>>> is due to both streaming semantics (we can't buffer an entire >>> response in >>>>>>> memory for large requests to _all_docs) as well as multi-doc >>> responses (a >>>>>>> single boolean doesn't say which document may have not had a properly >>> met >>>>>>> R). >>>>>>> >>>>>>> On top of that, the other confusing part of meeting the read quorum >>> is that >>>>>>> given MVCC semantics it becomes a bit confusing on how you respond to >>>>>>> documents with different revision histories. For instance, if we read >>> two >>>>>>> docs, we have technically made the r=2 requirement, but what should >>> our >>>>>>> response be if those two revisions are different (technically, in >>> this case >>>>>>> we wait for the third response, but the decision on what to return >>> for the >>>>>>> "r met" value is still unclear). >>>>>>> >>>>>>> While I think everyone is in agreement that it'd be nice to return >>> some of >>>>>>> the information about the copies read, I think its much less clear >>> what and >>>>>>> how it should be returned in the multitude of cases that we can >>> specify an >>>>>>> value for R. >>>>>>> >>>>>>> While that doesn't offer a concrete path forward, hopefully it >>> clarifies >>>>>>> some of the issues at hand. >>>>>>> >>>>>>> On Tue, Mar 31, 2015 at 1:47 PM, Robert Samuel Newson < >>> [email protected]> >>>>>>> wrote: >>>>>>> >>>>>>>> >>>>>>>> It’s testament to my friendship with Mike that we can disagree on >>> such >>>>>>>> things and remain friends. I am sorry he misled you, though. >>>>>>>> >>>>>>>> CouchDB 2.0 (like Cloudant) does not have read or write quorums at >>> all, at >>>>>>>> least in the formal sense, the only one that matters, this is >>> unfortunately >>>>>>>> sloppy language in too many places to correct. >>>>>>>> >>>>>>>> The r= and w= parameters control only how many of the n possible >>> responses >>>>>>>> are collected before returning an http response. >>>>>>>> >>>>>>>> It’s not true that returning 202 in the situation where one write is >>> made >>>>>>>> but fewer than 'r' writes are made means we’ve chosen availability >>> over >>>>>>>> consistency since even if we returned a 500 or closed the connection >>>>>>>> without responding, a subsequent GET could return the document (a >>>>>>>> probability that increases over time as anti-entropy makes the >>> missing >>>>>>>> copies). A write attempt that returned a 409 could, likewise, >>> introduce a >>>>>>>> new edit branch into the document, which might then 'win', altering >>> the >>>>>>>> results of a subsequent GET. >>>>>>>> >>>>>>>> The essential thing to remember is this: the ’n’ copies of your data >>> are >>>>>>>> completely independent when written/read by the clustered layer >>> (fabric). >>>>>>>> It is internal replication (anti-entropy) that converges those >>> copies, >>>>>>>> pair-wise, to the same eventual state. Fabric is converting the 3 >>>>>>>> independent results into a single result as best it can. Older >>> versions did >>>>>>>> not expose the 201 vs 202 distinction, calling both of them 201. I >>> do agree >>>>>>>> with you that there’s little value in the 202 distinction. About the >>> only >>>>>>>> thing you could do is investigate your cluster for connectivity >>> issues or >>>>>>>> overloading if you get a sustained period of 202’s, as it would be an >>>>>>>> indicator that the system is partitioned. >>>>>>>> >>>>>>>> In order to achieve your goals, CouchDB 2.0 would have to ensure >>> that the >>>>>>>> result of a write did not change after the fact. That is, >>> anti-entropy >>>>>>>> would need to be disabled, or somehow agree to roll forward or >>> backward >>>>>>>> based on the initial circumstances. In short, we’d have to introduce >>> strong >>>>>>>> consistency (paxos or raft or zab, say). While this would be a great >>>>>>>> feature to add, it’s not currently present, and no amount of >>> twiddling the >>>>>>>> status codes will achieve it. We’d rather be honest about our >>> position on >>>>>>>> the CAP triangle. >>>>>>>> >>>>>>>> B. >>>>>>>> >>>>>>>> >>>>>>>>>> On 30 Mar 2015, at 22:37, Nathan Vander Wilt < >>> [email protected]> >>>>>>>>> wrote: >>>>>>>>> >>>>>>>>> A technical co-founder of Cloudant agreed that this was a bug when I >>>>>>>> first hit it a few years ago. I found back the original thread here >>> — this >>>>>>>> is the discussion I was trying to recall in my OP: >>>>>>>>> It sounds like perhaps there is a related issue tracked internally >>> at >>>>>>>> Cloudant as a result of that conversation. >>>>>>>>> >>>>>>>>> JamesM, thanks for your support here and tracking this down. 203 >>> seemed >>>>>>>> like the best status code to "steal" for this to me too. Best wishes >>> in >>>>>>>> getting this fixed! >>>>>>>>> >>>>>>>>> regards, >>>>>>>>> -natevw >>>>>>>>> >>>>>>>>> >>>>>>>>>> On Mar 25, 2015, at 4:49 AM, Robert Newson <[email protected]> >>> wrote: >>>>>>>>>> >>>>>>>>>> 2.0 is explicitly an AP system, the behaviour you describe is not >>>>>>>> classified as a bug. >>>>>>>>>> >>>>>>>>>> Anti-entropy is the main reason that you cannot get strong >>> consistency >>>>>>>> from the system, it will transform "failed" writes (those that >>> succeeded on >>>>>>>> one node but fewer than R nodes) into success (N copies) as long as >>> the >>>>>>>> nodes have enough healthy uptime. >>>>>>>>>> >>>>>>>>>> True of cloudant and 2.0. >>>>>>>>>> >>>>>>>>>> Sent from my iPhone >>>>>>>>>> >>>>>>>>>>> On 24 Mar 2015, at 15:14, Mutton, James <[email protected]> >>> wrote: >>>>>>>>>>> >>>>>>>>>>> Funny you should mention it. I drafted an email in early >>> February to >>>>>>>> queue up the same discussion whenever I could get involved again >>> (which I >>>>>>>> promptly forgot about). What happens currently in 2.0 appears >>> unchanged >>>>>>>> from earlier versions. When R is not satisfied in fabric, >>>>>>>> fabric_doc_open:handle_message eventually responds with a {stop, …} >>> but >>>>>>>> leaves the acc-state as the original r_not_met which triggers a >>> read_repair >>>>>>>> from the response handler. read_repair results in an {ok, …} with >>> the only >>>>>>>> doc available, because no other docs are in the list. The final doc >>>>>>>> returned to chttpd_db:couch_doc_open and thusly to >>> chttpd_db:db_doc_req is >>>>>>>> simply {ok, Doc}, which has now lost the fact that the answer was not >>>>>>>> complete. >>>>>>>>>>> >>>>>>>>>>> This seems straightforward to fix by a change in >>>>>>>> fabric_open_doc:handle_response and read_repair. handle_response >>> knows >>>>>>>> whether it has R met and could pass that forward, or allow >>> read-repair to >>>>>>>> pass it forward if read_repair is able to satisfy acc.r. I can’t >>> speak for >>>>>>>> community interest in the behavior of sending a 202, but it’s >>> something I’d >>>>>>>> definitely like for the same reasons you cite. Plus it just seems >>>>>>>> disconnected to do it on writes but not reads. >>>>>>>>>>> >>>>>>>>>>> Cheers, >>>>>>>>>>> </JamesM> >>>>>>>>>>> >>>>>>>>>>>> On Mar 24, 2015, at 14:06, Nathan Vander Wilt < >>>>>>>> [email protected]> wrote: >>>>>>>>>>>> >>>>>>>>>>>> Sorry, I have not been following CouchDB 2.0 roadmap but I was >>>>>>>> extending my fermata-couchdb plugin today and realized that perhaps >>> the >>>>>>>> Apache release of BigCouch as CouchDB 2.0 might provide an >>> opportunity to >>>>>>>> fix a serious issue I had using Cloudant's implementation. >>>>>>>>>>>> >>>>>>>>>>>> See >>>>>>>> https://github.com/cloudant/bigcouch/issues/55#issuecomment-30186518 >>> for >>>>>>>> some additional background/explanation, but my understanding is that >>>>>>>> Cloudant for all practical purposes ignores the read durability >>> parameter. >>>>>>>> So you can write with ?w=N to attempt some level of quorum, and get >>> a 202 >>>>>>>> back if that quorum is unment. _However_ when you ?r=N it really >>> doesn't >>>>>>>> matter if only <N nodes are available…if even just a single >>> available node >>>>>>>> has some version of the requested document you will get a successful >>>>>>>> response (!). >>>>>>>>>>>> >>>>>>>>>>>> So in practice, there's no way to actually use the quasi-Dynamo >>>>>>>> features to dynamically _choose_ between consistency or availability >>> — when >>>>>>>> it comes time to read back a consistent result, BigCouch instead just >>>>>>>> always gives you availability* regardless of what a given request >>> actually >>>>>>>> needs. (In my usage I ended up treating a 202 write as a 500, rather >>> than >>>>>>>> proceeding with no way of ever knowing whether a write did NOT >>> ACTUALLY >>>>>>>> conflict or just hadn't YET because $who_knows_how_many nodes were >>> still >>>>>>>> down…) >>>>>>>>>>>> >>>>>>>>>>>> IIRC, this was both confirmed and acknowledged as a serious bug >>> by a >>>>>>>> Cloudant engineer (or support personnel at least) but could not be >>> quickly >>>>>>>> fixed as it could introduce backwards-compatibility concerns. So… >>>>>>>>>>>> >>>>>>>>>>>> Is CouchDB 2.0 already breaking backwards compatibility with >>>>>>>> BigCouch? If true, could this read durability issue now be fixed >>> during the >>>>>>>> merge? >>>>>>>>>>>> >>>>>>>>>>>> thanks, >>>>>>>>>>>> -natevw >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> * DISCLAIMER: this statement has not been endorsed by actual >>> uptime >>>>>>>> of *any* Couch fork… >>>>>> >>> >>> >
