Re: Could CouchDB 2.0 fix actual read quorum?

Mutton, James Fri, 03 Apr 2015 18:09:59 -0700

* Report the number of r_met failed conditions to a statistical aggregator for 
alerting or trending on client-visible behavior.
* Pause some operation for a time if possible, retry later.
* Possibly re-resolve and use another cluster that is more healthy or less 
loaded
* Indicate some hidden failure or bug in how shards got moved around/restored 
from down nodes


</JamesM>

On Apr 3, 2015, at 17:27, Robert Samuel Newson <[email protected]> wrote:

> 
> I’ve pushed an update to the fabric branch which accounts for when the r= 
> value is higher than the number of replicas (so that it returns r_met:false)
> 
> Changing this so that r_met is true only if R matching revisions are seen 
> doesn’t sound too difficult.
> 
> Where I struggle is seeing what a client can usefully do with this 
> information. When you receive the r_met:false indication, however we end up 
> conveying it, what will you do? Retry until r_met:true?
> 
> B.
> 
>> On 4 Apr 2015, at 00:55, Mutton, James <[email protected]> wrote:
>> 
>> Based on Paul’s description it sounds like we may need to decide 3 things to 
>> close this out:
>> * What does satisfying R mean?
>> * What is the appropriate scope of when R is applied?
>> * How do we most appropriately convey the lack of R?
>> 
>> I’m basing my opinions of R on W.  W is satisfied when a write succeeds to W 
>> nodes.  For behavior to be consistent between R and W, R should be 
>> considered to be met when R “matching” results have been found, if we treat 
>> “matching” == “successful”.  I believe this to be a more-correct 
>> interpretation of R-W consistency then treating R-satisfied as 
>> “found-but-not-matching” since it matches the complete positive of W's 
>> “successfully-written”.  For scope, W acts for both current versions and 
>> historical revision updates (e.g. resolving conflicts).  W also functions in 
>> bulk operations so R should function in multi-key requests as well if it’s 
>> to be consistent.
>> 
>> The last question is how to appropriately convey lack of R.  I tested these 
>> branches to see that the _r_met was present, that worked.  I also made some 
>> quick modifications to return a 203 to see how some clients behaved.  Here 
>> are my test results: https://gist.github.com/jamutton/c823fdac328777e22646
>> 
>> I tested a few clients including an old version of couchdbkit and all worked 
>> while the server was returning a 203 and/or the meta-field.  A quick 
>> test-with replication was mixed.  I did a replicate into a couchdb 1.6 
>> machine and although I did see some errors, replication succeeded (the 
>> errors were related to checkpointing the target and my 1.6 could have been 
>> messed up).  All that to say that where I tested it, returning a 203 on R 
>> was accepted behavior by clients, just as returning a 202 on W.  By no means 
>> is that extensive but at least indicative.  So, I think both approaches, 
>> field and status-code, are possible for single key requests (more on that in 
>> a second) and whether it’s status or field, I favor at least having 
>> consistency with W.  We could also have consistency by converting W’s 202 to 
>> a to be an in-document meta field like _w_met and only present when 
>> ?is_w_met=true is present on the query string.  That feels more drastic.
>> 
>> So the last issue is for the bulk/multi-doc responses.  Here the entire 
>> approach of reads and writes diverges.  Writes are still individual 
>> doc-updates, whereas reads of multi-docs are basically a “view” even if it’s 
>> all_docs.  IMHO, views could be called  out of scope for when R is Applied.  
>> It doesn’t even descend into doc_open to apply R unless “keys” are specified 
>> and normal views without include_docs would do the same IIRC.  This approach 
>> of calling all views out of scope because they could only even be in scope 
>> under certain circumstances, leaves the door open still for either a 
>> status-code or field (and again, if using a field it would be more 
>> consistent API behavior to switch W to behave the same)
>> 
>> Cheers,
>> </JamesM>
>> 
>> On Apr 2, 2015, at 3:51, Robert Samuel Newson <[email protected]> wrote:
>> 
>>> To move this along I have COUCHDB-2655 and three branches with a working 
>>> solution;
>>> 
>>> https://git-wip-us.apache.org/repos/asf?p=couchdb-chttpd.git;h=b408ce5
>>> https://git-wip-us.apache.org/repos/asf?p=couchdb-couch.git;h=7d811d3
>>> https://git-wip-us.apache.org/repos/asf?p=couchdb-fabric.git;h=90e9691
>>> 
>>> All three branches are called 2655-r-met if you want to try this locally 
>>> (and please do!)
>>> 
>>> Sample output;
>>> 
>>> curl -v 'foo:bar@localhost:15984/db1/doc1?is_r_met=true'
>>> 
>>> {"_id":"doc1","_rev":"1-967a00dff5e02add41819138abb3284d","_r_met":true}
>>> 
>>> By making it opt-in, I think we avoid all the collateral damage that Paul 
>>> was concerned about.
>>> 
>>> B.
>>> 
>>> 
>>>> On 2 Apr 2015, at 10:36, Robert Samuel Newson <[email protected]> wrote:
>>>> 
>>>> 
>>>> Yeah, not a bad idea. An extra query arg (akin to open_revs=all, 
>>>> conflicts=true, etc) would avoid compatibility breaks and would clearly 
>>>> put the onus on those supplying it to tolerate the presence of the extra 
>>>> reserved field.
>>>> 
>>>> +1
>>>> 
>>>> 
>>>>> On 2 Apr 2015, at 10:32, Benjamin Bastian <[email protected]> wrote:
>>>>> 
>>>>> What about adding an optional query parameter to indicate whether or not
>>>>> Couch should include the _r_met flag in the document body/bodies
>>>>> (defaulting to false)? That wouldn't break older clients and it'd work for
>>>>> the bulk API as well. As far as the case where there are conflicts, it
>>>>> seems like the most intuitive thing would be for the "r" in "_r_met" to
>>>>> have the same semantic meaning as the "r" in "?r=" (i.e. "?r=" means "wait
>>>>> for r copies of the same doc rev until a timeout" and "_r_met" would mean
>>>>> "we got/didn't get r copies of the same doc rev within the timeout").
>>>>> 
>>>>> Just my two cents.
>>>>> 
>>>>> On Thu, Apr 2, 2015 at 1:22 AM, Robert Samuel Newson <[email protected]>
>>>>> wrote:
>>>>> 
>>>>>> 
>>>>>> Paul outlined his previous efforts to introduce this indication, and the
>>>>>> problems he faced doing so. Can we come up with an acceptable mechanism?
>>>>>> 
>>>>>> A different status code will break a lot of users. While the http spec
>>>>>> says you can treat any 2xx code as success, plenty of libraries, etc, 
>>>>>> only
>>>>>> recognise 201 / 202 as successful write and 200 (and maybe 204, 206) for
>>>>>> reads.
>>>>>> 
>>>>>> My preference is for a change that "can’t" break anyone, which I think
>>>>>> only leaves an "X-CouchDB-R-Met: 2" response header, which isn’t the most
>>>>>> pleasant thing.
>>>>>> 
>>>>>> Suggestions?
>>>>>> 
>>>>>> B.
>>>>>> 
>>>>>> 
>>>>>>> On 1 Apr 2015, at 06:55, Mutton, James <[email protected]> wrote:
>>>>>>> 
>>>>>>> For at least my part of it, I agree with Adam. Bigcouch has made an
>>>>>> effort to inform in the case of a failure to apply W. I've seen it lead 
>>>>>> to
>>>>>> confusion when the same logic was not applied on R.
>>>>>>> 
>>>>>>> I also agree that W and R are not binding contracts. There's no
>>>>>> agreement protocol to assure that W is met before being committed to 
>>>>>> disk.
>>>>>> But they are exposed as a blocking parameter of the request, so
>>>>>> notification being consistent appeared to me to be the best compromise 
>>>>>> (vs
>>>>>> straight up removal).
>>>>>>> 
>>>>>>> </JamesM>
>>>>>>> 
>>>>>>> 
>>>>>>>> On Mar 31, 2015, at 13:15, Robert Newson <[email protected]> wrote:
>>>>>>>> 
>>>>>>>> 
>>>>>>>> If a way can be found that doesn't break things that can be sent in all
>>>>>> or most cases, sure. It's what a user can really infer from that which I
>>>>>> focused on. Not as much, I think, as users that want that info really 
>>>>>> want.
>>>>>>>> 
>>>>>>>> 
>>>>>>>>> On 31 Mar 2015, at 21:08, Adam Kocoloski <[email protected]> wrote:
>>>>>>>>> 
>>>>>>>>> I hope we can all agree that CouchDB should inform the user when it is
>>>>>> unable to satisfy the requested read "quorum".
>>>>>>>>> 
>>>>>>>>> Adam
>>>>>>>>> 
>>>>>>>>>> On Mar 31, 2015, at 3:20 PM, Paul Davis <[email protected]>
>>>>>> wrote:
>>>>>>>>>> 
>>>>>>>>>> Sounds like there's a bit of confusion here.
>>>>>>>>>> 
>>>>>>>>>> What Nathan is asking for is the ability to have Couch respond with
>>>>>> some
>>>>>>>>>> information on the actual number of replicas that responded to a read
>>>>>>>>>> request. That way a user could tell that they issued an r=2 request
>>>>>> when
>>>>>>>>>> only r=1 was actually performed. Depending on your point of view in
>>>>>> an MVCC
>>>>>>>>>> world this is either a bug or a feature. :)
>>>>>>>>>> 
>>>>>>>>>> It was generally agreed upon that if we could return this information
>>>>>> it
>>>>>>>>>> would be beneficial. Although what happened when I started
>>>>>> implementing
>>>>>>>>>> this patch was that we are either only able to return it in a subset
>>>>>> of
>>>>>>>>>> cases where it happens, return it inconsistently between various
>>>>>> responses,
>>>>>>>>>> or break replication.
>>>>>>>>>> 
>>>>>>>>>> The three general methods for this would be to either include a new
>>>>>>>>>> "_r_met" key in the doc body that would be a boolean indicating if 
>>>>>>>>>> the
>>>>>>>>>> requested read quorum was actually met for the document. The second
>>>>>> was to
>>>>>>>>>> return a custom X-R-Met type header, and lastly was the status code 
>>>>>>>>>> as
>>>>>>>>>> described.
>>>>>>>>>> 
>>>>>>>>>> The _r_met member was thought to be the best, but unfortunately that
>>>>>> breaks
>>>>>>>>>> replication with older clients because we throw an error rather than
>>>>>> ignore
>>>>>>>>>> any unknown underscore prefixed field name. Thus having something
>>>>>> that was
>>>>>>>>>> just dynamically injected into the document body was a non-starter.
>>>>>>>>>> Unfortunately, if we don't inject into the document body then we 
>>>>>>>>>> limit
>>>>>>>>>> ourselves to only the set of APIs where a single document is
>>>>>> returned. This
>>>>>>>>>> is due to both streaming semantics (we can't buffer an entire
>>>>>> response in
>>>>>>>>>> memory for large requests to _all_docs) as well as multi-doc
>>>>>> responses (a
>>>>>>>>>> single boolean doesn't say which document may have not had a properly
>>>>>> met
>>>>>>>>>> R).
>>>>>>>>>> 
>>>>>>>>>> On top of that, the other confusing part of meeting the read quorum
>>>>>> is that
>>>>>>>>>> given MVCC semantics it becomes a bit confusing on how you respond to
>>>>>>>>>> documents with different revision histories. For instance, if we read
>>>>>> two
>>>>>>>>>> docs, we have technically made the r=2 requirement, but what should
>>>>>> our
>>>>>>>>>> response be if those two revisions are different (technically, in
>>>>>> this case
>>>>>>>>>> we wait for the third response, but the decision on what to return
>>>>>> for the
>>>>>>>>>> "r met" value is still unclear).
>>>>>>>>>> 
>>>>>>>>>> While I think everyone is in agreement that it'd be nice to return
>>>>>> some of
>>>>>>>>>> the information about the copies read, I think its much less clear
>>>>>> what and
>>>>>>>>>> how it should be returned in the multitude of cases that we can
>>>>>> specify an
>>>>>>>>>> value for R.
>>>>>>>>>> 
>>>>>>>>>> While that doesn't offer a concrete path forward, hopefully it
>>>>>> clarifies
>>>>>>>>>> some of the issues at hand.
>>>>>>>>>> 
>>>>>>>>>> On Tue, Mar 31, 2015 at 1:47 PM, Robert Samuel Newson <
>>>>>> [email protected]>
>>>>>>>>>> wrote:
>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> It’s testament to my friendship with Mike that we can disagree on
>>>>>> such
>>>>>>>>>>> things and remain friends. I am sorry he misled you, though.
>>>>>>>>>>> 
>>>>>>>>>>> CouchDB 2.0 (like Cloudant) does not have read or write quorums at
>>>>>> all, at
>>>>>>>>>>> least in the formal sense, the only one that matters, this is
>>>>>> unfortunately
>>>>>>>>>>> sloppy language in too many places to correct.
>>>>>>>>>>> 
>>>>>>>>>>> The r= and w= parameters control only how many of the n possible
>>>>>> responses
>>>>>>>>>>> are collected before returning an http response.
>>>>>>>>>>> 
>>>>>>>>>>> It’s not true that returning 202 in the situation where one write is
>>>>>> made
>>>>>>>>>>> but fewer than 'r' writes are made means we’ve chosen availability
>>>>>> over
>>>>>>>>>>> consistency since even if we returned a 500 or closed the connection
>>>>>>>>>>> without responding, a subsequent GET could return the document (a
>>>>>>>>>>> probability that increases over time as anti-entropy makes the
>>>>>> missing
>>>>>>>>>>> copies). A write attempt that returned a 409 could, likewise,
>>>>>> introduce a
>>>>>>>>>>> new edit branch into the document, which might then 'win', altering
>>>>>> the
>>>>>>>>>>> results of a subsequent GET.
>>>>>>>>>>> 
>>>>>>>>>>> The essential thing to remember is this: the ’n’ copies of your data
>>>>>> are
>>>>>>>>>>> completely independent when written/read by the clustered layer
>>>>>> (fabric).
>>>>>>>>>>> It is internal replication (anti-entropy) that converges those
>>>>>> copies,
>>>>>>>>>>> pair-wise, to the same eventual state. Fabric is converting the 3
>>>>>>>>>>> independent results into a single result as best it can. Older
>>>>>> versions did
>>>>>>>>>>> not expose the 201 vs 202 distinction, calling both of them 201. I
>>>>>> do agree
>>>>>>>>>>> with you that there’s little value in the 202 distinction. About the
>>>>>> only
>>>>>>>>>>> thing you could do is investigate your cluster for connectivity
>>>>>> issues or
>>>>>>>>>>> overloading if you get a sustained period of 202’s, as it would be 
>>>>>>>>>>> an
>>>>>>>>>>> indicator that the system is partitioned.
>>>>>>>>>>> 
>>>>>>>>>>> In order to achieve your goals, CouchDB 2.0 would have to ensure
>>>>>> that the
>>>>>>>>>>> result of a write did not change after the fact. That is,
>>>>>> anti-entropy
>>>>>>>>>>> would need to be disabled, or somehow agree to roll forward or
>>>>>> backward
>>>>>>>>>>> based on the initial circumstances. In short, we’d have to introduce
>>>>>> strong
>>>>>>>>>>> consistency (paxos or raft or zab, say). While this would be a great
>>>>>>>>>>> feature to add, it’s not currently present, and no amount of
>>>>>> twiddling the
>>>>>>>>>>> status codes will achieve it. We’d rather be honest about our
>>>>>> position on
>>>>>>>>>>> the CAP triangle.
>>>>>>>>>>> 
>>>>>>>>>>> B.
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>>>> On 30 Mar 2015, at 22:37, Nathan Vander Wilt <
>>>>>> [email protected]>
>>>>>>>>>>>> wrote:
>>>>>>>>>>>> 
>>>>>>>>>>>> A technical co-founder of Cloudant agreed that this was a bug when 
>>>>>>>>>>>> I
>>>>>>>>>>> first hit it a few years ago. I found back the original thread here
>>>>>> — this
>>>>>>>>>>> is the discussion I was trying to recall in my OP:
>>>>>>>>>>>> It sounds like perhaps there is a related issue tracked internally
>>>>>> at
>>>>>>>>>>> Cloudant as a result of that conversation.
>>>>>>>>>>>> 
>>>>>>>>>>>> JamesM, thanks for your support here and tracking this down. 203
>>>>>> seemed
>>>>>>>>>>> like the best status code to "steal" for this to me too. Best wishes
>>>>>> in
>>>>>>>>>>> getting this fixed!
>>>>>>>>>>>> 
>>>>>>>>>>>> regards,
>>>>>>>>>>>> -natevw
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>>> On Mar 25, 2015, at 4:49 AM, Robert Newson <[email protected]>
>>>>>> wrote:
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 2.0 is explicitly an AP system, the behaviour you describe is not
>>>>>>>>>>> classified as a bug.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Anti-entropy is the main reason that you cannot get strong
>>>>>> consistency
>>>>>>>>>>> from the system, it will transform "failed" writes (those that
>>>>>> succeeded on
>>>>>>>>>>> one node but fewer than R nodes) into success (N copies) as long as
>>>>>> the
>>>>>>>>>>> nodes have enough healthy uptime.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> True of cloudant and 2.0.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Sent from my iPhone
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> On 24 Mar 2015, at 15:14, Mutton, James <[email protected]>
>>>>>> wrote:
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Funny you should mention it.  I drafted an email in early
>>>>>> February to
>>>>>>>>>>> queue up the same discussion whenever I could get involved again
>>>>>> (which I
>>>>>>>>>>> promptly forgot about).  What happens currently in 2.0 appears
>>>>>> unchanged
>>>>>>>>>>> from earlier versions.  When R is not satisfied in fabric,
>>>>>>>>>>> fabric_doc_open:handle_message eventually responds with a {stop, …}
>>>>>> but
>>>>>>>>>>> leaves the acc-state as the original r_not_met which triggers a
>>>>>> read_repair
>>>>>>>>>>> from the response handler.  read_repair results in an {ok, …} with
>>>>>> the only
>>>>>>>>>>> doc available, because no other docs are in the list.  The final doc
>>>>>>>>>>> returned to chttpd_db:couch_doc_open and thusly to
>>>>>> chttpd_db:db_doc_req is
>>>>>>>>>>> simply {ok, Doc}, which has now lost the fact that the answer was 
>>>>>>>>>>> not
>>>>>>>>>>> complete.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> This seems straightforward to fix by a change in
>>>>>>>>>>> fabric_open_doc:handle_response and read_repair.  handle_response
>>>>>> knows
>>>>>>>>>>> whether it has R met and could pass that forward, or allow
>>>>>> read-repair to
>>>>>>>>>>> pass it forward if read_repair is able to satisfy acc.r.  I can’t
>>>>>> speak for
>>>>>>>>>>> community interest in the behavior of sending a 202, but it’s
>>>>>> something I’d
>>>>>>>>>>> definitely like for the same reasons you cite.  Plus it just seems
>>>>>>>>>>> disconnected to do it on writes but not reads.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Cheers,
>>>>>>>>>>>>>> </JamesM>
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> On Mar 24, 2015, at 14:06, Nathan Vander Wilt <
>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Sorry, I have not been following CouchDB 2.0 roadmap but I was
>>>>>>>>>>> extending my fermata-couchdb plugin today and realized that perhaps
>>>>>> the
>>>>>>>>>>> Apache release of BigCouch as CouchDB 2.0 might provide an
>>>>>> opportunity to
>>>>>>>>>>> fix a serious issue I had using Cloudant's implementation.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> See
>>>>>>>>>>> https://github.com/cloudant/bigcouch/issues/55#issuecomment-30186518
>>>>>> for
>>>>>>>>>>> some additional background/explanation, but my understanding is that
>>>>>>>>>>> Cloudant for all practical purposes ignores the read durability
>>>>>> parameter.
>>>>>>>>>>> So you can write with ?w=N to attempt some level of quorum, and get
>>>>>> a 202
>>>>>>>>>>> back if that quorum is unment. _However_ when you ?r=N it really
>>>>>> doesn't
>>>>>>>>>>> matter if only <N nodes are available…if even just a single
>>>>>> available node
>>>>>>>>>>> has some version of the requested document you will get a successful
>>>>>>>>>>> response (!).
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> So in practice, there's no way to actually use the quasi-Dynamo
>>>>>>>>>>> features to dynamically _choose_ between consistency or availability
>>>>>> — when
>>>>>>>>>>> it comes time to read back a consistent result, BigCouch instead 
>>>>>>>>>>> just
>>>>>>>>>>> always gives you availability* regardless of what a given request
>>>>>> actually
>>>>>>>>>>> needs. (In my usage I ended up treating a 202 write as a 500, rather
>>>>>> than
>>>>>>>>>>> proceeding with no way of ever knowing whether a write did NOT
>>>>>> ACTUALLY
>>>>>>>>>>> conflict or just hadn't YET because $who_knows_how_many nodes were
>>>>>> still
>>>>>>>>>>> down…)
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> IIRC, this was both confirmed and acknowledged as a serious bug
>>>>>> by a
>>>>>>>>>>> Cloudant engineer (or support personnel at least) but could not be
>>>>>> quickly
>>>>>>>>>>> fixed as it could introduce backwards-compatibility concerns. So…
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Is CouchDB 2.0 already breaking backwards compatibility with
>>>>>>>>>>> BigCouch? If true, could this read durability issue now be fixed
>>>>>> during the
>>>>>>>>>>> merge?
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> thanks,
>>>>>>>>>>>>>>> -natevw
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> * DISCLAIMER: this statement has not been endorsed by actual
>>>>>> uptime
>>>>>>>>>>> of *any* Couch fork…
>>>>>>>>> 
>>>>>> 
>>>>>> 
>>>> 
>>> 
>> 
>

Re: Could CouchDB 2.0 fix actual read quorum?

Reply via email to