* Report the number of r_met failed conditions to a statistical aggregator for alerting or trending on client-visible behavior. * Pause some operation for a time if possible, retry later. * Possibly re-resolve and use another cluster that is more healthy or less loaded * Indicate some hidden failure or bug in how shards got moved around/restored from down nodes
</JamesM> On Apr 3, 2015, at 17:27, Robert Samuel Newson <[email protected]> wrote: > > I’ve pushed an update to the fabric branch which accounts for when the r= > value is higher than the number of replicas (so that it returns r_met:false) > > Changing this so that r_met is true only if R matching revisions are seen > doesn’t sound too difficult. > > Where I struggle is seeing what a client can usefully do with this > information. When you receive the r_met:false indication, however we end up > conveying it, what will you do? Retry until r_met:true? > > B. > >> On 4 Apr 2015, at 00:55, Mutton, James <[email protected]> wrote: >> >> Based on Paul’s description it sounds like we may need to decide 3 things to >> close this out: >> * What does satisfying R mean? >> * What is the appropriate scope of when R is applied? >> * How do we most appropriately convey the lack of R? >> >> I’m basing my opinions of R on W. W is satisfied when a write succeeds to W >> nodes. For behavior to be consistent between R and W, R should be >> considered to be met when R “matching” results have been found, if we treat >> “matching” == “successful”. I believe this to be a more-correct >> interpretation of R-W consistency then treating R-satisfied as >> “found-but-not-matching” since it matches the complete positive of W's >> “successfully-written”. For scope, W acts for both current versions and >> historical revision updates (e.g. resolving conflicts). W also functions in >> bulk operations so R should function in multi-key requests as well if it’s >> to be consistent. >> >> The last question is how to appropriately convey lack of R. I tested these >> branches to see that the _r_met was present, that worked. I also made some >> quick modifications to return a 203 to see how some clients behaved. Here >> are my test results: https://gist.github.com/jamutton/c823fdac328777e22646 >> >> I tested a few clients including an old version of couchdbkit and all worked >> while the server was returning a 203 and/or the meta-field. A quick >> test-with replication was mixed. I did a replicate into a couchdb 1.6 >> machine and although I did see some errors, replication succeeded (the >> errors were related to checkpointing the target and my 1.6 could have been >> messed up). All that to say that where I tested it, returning a 203 on R >> was accepted behavior by clients, just as returning a 202 on W. By no means >> is that extensive but at least indicative. So, I think both approaches, >> field and status-code, are possible for single key requests (more on that in >> a second) and whether it’s status or field, I favor at least having >> consistency with W. We could also have consistency by converting W’s 202 to >> a to be an in-document meta field like _w_met and only present when >> ?is_w_met=true is present on the query string. That feels more drastic. >> >> So the last issue is for the bulk/multi-doc responses. Here the entire >> approach of reads and writes diverges. Writes are still individual >> doc-updates, whereas reads of multi-docs are basically a “view” even if it’s >> all_docs. IMHO, views could be called out of scope for when R is Applied. >> It doesn’t even descend into doc_open to apply R unless “keys” are specified >> and normal views without include_docs would do the same IIRC. This approach >> of calling all views out of scope because they could only even be in scope >> under certain circumstances, leaves the door open still for either a >> status-code or field (and again, if using a field it would be more >> consistent API behavior to switch W to behave the same) >> >> Cheers, >> </JamesM> >> >> On Apr 2, 2015, at 3:51, Robert Samuel Newson <[email protected]> wrote: >> >>> To move this along I have COUCHDB-2655 and three branches with a working >>> solution; >>> >>> https://git-wip-us.apache.org/repos/asf?p=couchdb-chttpd.git;h=b408ce5 >>> https://git-wip-us.apache.org/repos/asf?p=couchdb-couch.git;h=7d811d3 >>> https://git-wip-us.apache.org/repos/asf?p=couchdb-fabric.git;h=90e9691 >>> >>> All three branches are called 2655-r-met if you want to try this locally >>> (and please do!) >>> >>> Sample output; >>> >>> curl -v 'foo:bar@localhost:15984/db1/doc1?is_r_met=true' >>> >>> {"_id":"doc1","_rev":"1-967a00dff5e02add41819138abb3284d","_r_met":true} >>> >>> By making it opt-in, I think we avoid all the collateral damage that Paul >>> was concerned about. >>> >>> B. >>> >>> >>>> On 2 Apr 2015, at 10:36, Robert Samuel Newson <[email protected]> wrote: >>>> >>>> >>>> Yeah, not a bad idea. An extra query arg (akin to open_revs=all, >>>> conflicts=true, etc) would avoid compatibility breaks and would clearly >>>> put the onus on those supplying it to tolerate the presence of the extra >>>> reserved field. >>>> >>>> +1 >>>> >>>> >>>>> On 2 Apr 2015, at 10:32, Benjamin Bastian <[email protected]> wrote: >>>>> >>>>> What about adding an optional query parameter to indicate whether or not >>>>> Couch should include the _r_met flag in the document body/bodies >>>>> (defaulting to false)? That wouldn't break older clients and it'd work for >>>>> the bulk API as well. As far as the case where there are conflicts, it >>>>> seems like the most intuitive thing would be for the "r" in "_r_met" to >>>>> have the same semantic meaning as the "r" in "?r=" (i.e. "?r=" means "wait >>>>> for r copies of the same doc rev until a timeout" and "_r_met" would mean >>>>> "we got/didn't get r copies of the same doc rev within the timeout"). >>>>> >>>>> Just my two cents. >>>>> >>>>> On Thu, Apr 2, 2015 at 1:22 AM, Robert Samuel Newson <[email protected]> >>>>> wrote: >>>>> >>>>>> >>>>>> Paul outlined his previous efforts to introduce this indication, and the >>>>>> problems he faced doing so. Can we come up with an acceptable mechanism? >>>>>> >>>>>> A different status code will break a lot of users. While the http spec >>>>>> says you can treat any 2xx code as success, plenty of libraries, etc, >>>>>> only >>>>>> recognise 201 / 202 as successful write and 200 (and maybe 204, 206) for >>>>>> reads. >>>>>> >>>>>> My preference is for a change that "can’t" break anyone, which I think >>>>>> only leaves an "X-CouchDB-R-Met: 2" response header, which isn’t the most >>>>>> pleasant thing. >>>>>> >>>>>> Suggestions? >>>>>> >>>>>> B. >>>>>> >>>>>> >>>>>>> On 1 Apr 2015, at 06:55, Mutton, James <[email protected]> wrote: >>>>>>> >>>>>>> For at least my part of it, I agree with Adam. Bigcouch has made an >>>>>> effort to inform in the case of a failure to apply W. I've seen it lead >>>>>> to >>>>>> confusion when the same logic was not applied on R. >>>>>>> >>>>>>> I also agree that W and R are not binding contracts. There's no >>>>>> agreement protocol to assure that W is met before being committed to >>>>>> disk. >>>>>> But they are exposed as a blocking parameter of the request, so >>>>>> notification being consistent appeared to me to be the best compromise >>>>>> (vs >>>>>> straight up removal). >>>>>>> >>>>>>> </JamesM> >>>>>>> >>>>>>> >>>>>>>> On Mar 31, 2015, at 13:15, Robert Newson <[email protected]> wrote: >>>>>>>> >>>>>>>> >>>>>>>> If a way can be found that doesn't break things that can be sent in all >>>>>> or most cases, sure. It's what a user can really infer from that which I >>>>>> focused on. Not as much, I think, as users that want that info really >>>>>> want. >>>>>>>> >>>>>>>> >>>>>>>>> On 31 Mar 2015, at 21:08, Adam Kocoloski <[email protected]> wrote: >>>>>>>>> >>>>>>>>> I hope we can all agree that CouchDB should inform the user when it is >>>>>> unable to satisfy the requested read "quorum". >>>>>>>>> >>>>>>>>> Adam >>>>>>>>> >>>>>>>>>> On Mar 31, 2015, at 3:20 PM, Paul Davis <[email protected]> >>>>>> wrote: >>>>>>>>>> >>>>>>>>>> Sounds like there's a bit of confusion here. >>>>>>>>>> >>>>>>>>>> What Nathan is asking for is the ability to have Couch respond with >>>>>> some >>>>>>>>>> information on the actual number of replicas that responded to a read >>>>>>>>>> request. That way a user could tell that they issued an r=2 request >>>>>> when >>>>>>>>>> only r=1 was actually performed. Depending on your point of view in >>>>>> an MVCC >>>>>>>>>> world this is either a bug or a feature. :) >>>>>>>>>> >>>>>>>>>> It was generally agreed upon that if we could return this information >>>>>> it >>>>>>>>>> would be beneficial. Although what happened when I started >>>>>> implementing >>>>>>>>>> this patch was that we are either only able to return it in a subset >>>>>> of >>>>>>>>>> cases where it happens, return it inconsistently between various >>>>>> responses, >>>>>>>>>> or break replication. >>>>>>>>>> >>>>>>>>>> The three general methods for this would be to either include a new >>>>>>>>>> "_r_met" key in the doc body that would be a boolean indicating if >>>>>>>>>> the >>>>>>>>>> requested read quorum was actually met for the document. The second >>>>>> was to >>>>>>>>>> return a custom X-R-Met type header, and lastly was the status code >>>>>>>>>> as >>>>>>>>>> described. >>>>>>>>>> >>>>>>>>>> The _r_met member was thought to be the best, but unfortunately that >>>>>> breaks >>>>>>>>>> replication with older clients because we throw an error rather than >>>>>> ignore >>>>>>>>>> any unknown underscore prefixed field name. Thus having something >>>>>> that was >>>>>>>>>> just dynamically injected into the document body was a non-starter. >>>>>>>>>> Unfortunately, if we don't inject into the document body then we >>>>>>>>>> limit >>>>>>>>>> ourselves to only the set of APIs where a single document is >>>>>> returned. This >>>>>>>>>> is due to both streaming semantics (we can't buffer an entire >>>>>> response in >>>>>>>>>> memory for large requests to _all_docs) as well as multi-doc >>>>>> responses (a >>>>>>>>>> single boolean doesn't say which document may have not had a properly >>>>>> met >>>>>>>>>> R). >>>>>>>>>> >>>>>>>>>> On top of that, the other confusing part of meeting the read quorum >>>>>> is that >>>>>>>>>> given MVCC semantics it becomes a bit confusing on how you respond to >>>>>>>>>> documents with different revision histories. For instance, if we read >>>>>> two >>>>>>>>>> docs, we have technically made the r=2 requirement, but what should >>>>>> our >>>>>>>>>> response be if those two revisions are different (technically, in >>>>>> this case >>>>>>>>>> we wait for the third response, but the decision on what to return >>>>>> for the >>>>>>>>>> "r met" value is still unclear). >>>>>>>>>> >>>>>>>>>> While I think everyone is in agreement that it'd be nice to return >>>>>> some of >>>>>>>>>> the information about the copies read, I think its much less clear >>>>>> what and >>>>>>>>>> how it should be returned in the multitude of cases that we can >>>>>> specify an >>>>>>>>>> value for R. >>>>>>>>>> >>>>>>>>>> While that doesn't offer a concrete path forward, hopefully it >>>>>> clarifies >>>>>>>>>> some of the issues at hand. >>>>>>>>>> >>>>>>>>>> On Tue, Mar 31, 2015 at 1:47 PM, Robert Samuel Newson < >>>>>> [email protected]> >>>>>>>>>> wrote: >>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> It’s testament to my friendship with Mike that we can disagree on >>>>>> such >>>>>>>>>>> things and remain friends. I am sorry he misled you, though. >>>>>>>>>>> >>>>>>>>>>> CouchDB 2.0 (like Cloudant) does not have read or write quorums at >>>>>> all, at >>>>>>>>>>> least in the formal sense, the only one that matters, this is >>>>>> unfortunately >>>>>>>>>>> sloppy language in too many places to correct. >>>>>>>>>>> >>>>>>>>>>> The r= and w= parameters control only how many of the n possible >>>>>> responses >>>>>>>>>>> are collected before returning an http response. >>>>>>>>>>> >>>>>>>>>>> It’s not true that returning 202 in the situation where one write is >>>>>> made >>>>>>>>>>> but fewer than 'r' writes are made means we’ve chosen availability >>>>>> over >>>>>>>>>>> consistency since even if we returned a 500 or closed the connection >>>>>>>>>>> without responding, a subsequent GET could return the document (a >>>>>>>>>>> probability that increases over time as anti-entropy makes the >>>>>> missing >>>>>>>>>>> copies). A write attempt that returned a 409 could, likewise, >>>>>> introduce a >>>>>>>>>>> new edit branch into the document, which might then 'win', altering >>>>>> the >>>>>>>>>>> results of a subsequent GET. >>>>>>>>>>> >>>>>>>>>>> The essential thing to remember is this: the ’n’ copies of your data >>>>>> are >>>>>>>>>>> completely independent when written/read by the clustered layer >>>>>> (fabric). >>>>>>>>>>> It is internal replication (anti-entropy) that converges those >>>>>> copies, >>>>>>>>>>> pair-wise, to the same eventual state. Fabric is converting the 3 >>>>>>>>>>> independent results into a single result as best it can. Older >>>>>> versions did >>>>>>>>>>> not expose the 201 vs 202 distinction, calling both of them 201. I >>>>>> do agree >>>>>>>>>>> with you that there’s little value in the 202 distinction. About the >>>>>> only >>>>>>>>>>> thing you could do is investigate your cluster for connectivity >>>>>> issues or >>>>>>>>>>> overloading if you get a sustained period of 202’s, as it would be >>>>>>>>>>> an >>>>>>>>>>> indicator that the system is partitioned. >>>>>>>>>>> >>>>>>>>>>> In order to achieve your goals, CouchDB 2.0 would have to ensure >>>>>> that the >>>>>>>>>>> result of a write did not change after the fact. That is, >>>>>> anti-entropy >>>>>>>>>>> would need to be disabled, or somehow agree to roll forward or >>>>>> backward >>>>>>>>>>> based on the initial circumstances. In short, we’d have to introduce >>>>>> strong >>>>>>>>>>> consistency (paxos or raft or zab, say). While this would be a great >>>>>>>>>>> feature to add, it’s not currently present, and no amount of >>>>>> twiddling the >>>>>>>>>>> status codes will achieve it. We’d rather be honest about our >>>>>> position on >>>>>>>>>>> the CAP triangle. >>>>>>>>>>> >>>>>>>>>>> B. >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>>>> On 30 Mar 2015, at 22:37, Nathan Vander Wilt < >>>>>> [email protected]> >>>>>>>>>>>> wrote: >>>>>>>>>>>> >>>>>>>>>>>> A technical co-founder of Cloudant agreed that this was a bug when >>>>>>>>>>>> I >>>>>>>>>>> first hit it a few years ago. I found back the original thread here >>>>>> — this >>>>>>>>>>> is the discussion I was trying to recall in my OP: >>>>>>>>>>>> It sounds like perhaps there is a related issue tracked internally >>>>>> at >>>>>>>>>>> Cloudant as a result of that conversation. >>>>>>>>>>>> >>>>>>>>>>>> JamesM, thanks for your support here and tracking this down. 203 >>>>>> seemed >>>>>>>>>>> like the best status code to "steal" for this to me too. Best wishes >>>>>> in >>>>>>>>>>> getting this fixed! >>>>>>>>>>>> >>>>>>>>>>>> regards, >>>>>>>>>>>> -natevw >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>>> On Mar 25, 2015, at 4:49 AM, Robert Newson <[email protected]> >>>>>> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>> 2.0 is explicitly an AP system, the behaviour you describe is not >>>>>>>>>>> classified as a bug. >>>>>>>>>>>>> >>>>>>>>>>>>> Anti-entropy is the main reason that you cannot get strong >>>>>> consistency >>>>>>>>>>> from the system, it will transform "failed" writes (those that >>>>>> succeeded on >>>>>>>>>>> one node but fewer than R nodes) into success (N copies) as long as >>>>>> the >>>>>>>>>>> nodes have enough healthy uptime. >>>>>>>>>>>>> >>>>>>>>>>>>> True of cloudant and 2.0. >>>>>>>>>>>>> >>>>>>>>>>>>> Sent from my iPhone >>>>>>>>>>>>> >>>>>>>>>>>>>> On 24 Mar 2015, at 15:14, Mutton, James <[email protected]> >>>>>> wrote: >>>>>>>>>>>>>> >>>>>>>>>>>>>> Funny you should mention it. I drafted an email in early >>>>>> February to >>>>>>>>>>> queue up the same discussion whenever I could get involved again >>>>>> (which I >>>>>>>>>>> promptly forgot about). What happens currently in 2.0 appears >>>>>> unchanged >>>>>>>>>>> from earlier versions. When R is not satisfied in fabric, >>>>>>>>>>> fabric_doc_open:handle_message eventually responds with a {stop, …} >>>>>> but >>>>>>>>>>> leaves the acc-state as the original r_not_met which triggers a >>>>>> read_repair >>>>>>>>>>> from the response handler. read_repair results in an {ok, …} with >>>>>> the only >>>>>>>>>>> doc available, because no other docs are in the list. The final doc >>>>>>>>>>> returned to chttpd_db:couch_doc_open and thusly to >>>>>> chttpd_db:db_doc_req is >>>>>>>>>>> simply {ok, Doc}, which has now lost the fact that the answer was >>>>>>>>>>> not >>>>>>>>>>> complete. >>>>>>>>>>>>>> >>>>>>>>>>>>>> This seems straightforward to fix by a change in >>>>>>>>>>> fabric_open_doc:handle_response and read_repair. handle_response >>>>>> knows >>>>>>>>>>> whether it has R met and could pass that forward, or allow >>>>>> read-repair to >>>>>>>>>>> pass it forward if read_repair is able to satisfy acc.r. I can’t >>>>>> speak for >>>>>>>>>>> community interest in the behavior of sending a 202, but it’s >>>>>> something I’d >>>>>>>>>>> definitely like for the same reasons you cite. Plus it just seems >>>>>>>>>>> disconnected to do it on writes but not reads. >>>>>>>>>>>>>> >>>>>>>>>>>>>> Cheers, >>>>>>>>>>>>>> </JamesM> >>>>>>>>>>>>>> >>>>>>>>>>>>>>> On Mar 24, 2015, at 14:06, Nathan Vander Wilt < >>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Sorry, I have not been following CouchDB 2.0 roadmap but I was >>>>>>>>>>> extending my fermata-couchdb plugin today and realized that perhaps >>>>>> the >>>>>>>>>>> Apache release of BigCouch as CouchDB 2.0 might provide an >>>>>> opportunity to >>>>>>>>>>> fix a serious issue I had using Cloudant's implementation. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> See >>>>>>>>>>> https://github.com/cloudant/bigcouch/issues/55#issuecomment-30186518 >>>>>> for >>>>>>>>>>> some additional background/explanation, but my understanding is that >>>>>>>>>>> Cloudant for all practical purposes ignores the read durability >>>>>> parameter. >>>>>>>>>>> So you can write with ?w=N to attempt some level of quorum, and get >>>>>> a 202 >>>>>>>>>>> back if that quorum is unment. _However_ when you ?r=N it really >>>>>> doesn't >>>>>>>>>>> matter if only <N nodes are available…if even just a single >>>>>> available node >>>>>>>>>>> has some version of the requested document you will get a successful >>>>>>>>>>> response (!). >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> So in practice, there's no way to actually use the quasi-Dynamo >>>>>>>>>>> features to dynamically _choose_ between consistency or availability >>>>>> — when >>>>>>>>>>> it comes time to read back a consistent result, BigCouch instead >>>>>>>>>>> just >>>>>>>>>>> always gives you availability* regardless of what a given request >>>>>> actually >>>>>>>>>>> needs. (In my usage I ended up treating a 202 write as a 500, rather >>>>>> than >>>>>>>>>>> proceeding with no way of ever knowing whether a write did NOT >>>>>> ACTUALLY >>>>>>>>>>> conflict or just hadn't YET because $who_knows_how_many nodes were >>>>>> still >>>>>>>>>>> down…) >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> IIRC, this was both confirmed and acknowledged as a serious bug >>>>>> by a >>>>>>>>>>> Cloudant engineer (or support personnel at least) but could not be >>>>>> quickly >>>>>>>>>>> fixed as it could introduce backwards-compatibility concerns. So… >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Is CouchDB 2.0 already breaking backwards compatibility with >>>>>>>>>>> BigCouch? If true, could this read durability issue now be fixed >>>>>> during the >>>>>>>>>>> merge? >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> thanks, >>>>>>>>>>>>>>> -natevw >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> * DISCLAIMER: this statement has not been endorsed by actual >>>>>> uptime >>>>>>>>>>> of *any* Couch fork… >>>>>>>>> >>>>>> >>>>>> >>>> >>> >> >
