Re: Could CouchDB 2.0 fix actual read quorum?

Robert Samuel Newson Thu, 02 Apr 2015 03:56:55 -0700

To move this along I have COUCHDB-2655 and three branches with a working 
solution;


https://git-wip-us.apache.org/repos/asf?p=couchdb-chttpd.git;h=b408ce5
https://git-wip-us.apache.org/repos/asf?p=couchdb-couch.git;h=7d811d3
https://git-wip-us.apache.org/repos/asf?p=couchdb-fabric.git;h=90e9691

All three branches are called 2655-r-met if you want to try this locally (and 
please do!)

Sample output;

curl -v 'foo:bar@localhost:15984/db1/doc1?is_r_met=true'

{"_id":"doc1","_rev":"1-967a00dff5e02add41819138abb3284d","_r_met":true}

By making it opt-in, I think we avoid all the collateral damage that Paul was 
concerned about.

B.


> On 2 Apr 2015, at 10:36, Robert Samuel Newson <[email protected]> wrote:
> 
> 
> Yeah, not a bad idea. An extra query arg (akin to open_revs=all, 
> conflicts=true, etc) would avoid compatibility breaks and would clearly put 
> the onus on those supplying it to tolerate the presence of the extra reserved 
> field.
> 
> +1
> 
> 
>> On 2 Apr 2015, at 10:32, Benjamin Bastian <[email protected]> wrote:
>> 
>> What about adding an optional query parameter to indicate whether or not
>> Couch should include the _r_met flag in the document body/bodies
>> (defaulting to false)? That wouldn't break older clients and it'd work for
>> the bulk API as well. As far as the case where there are conflicts, it
>> seems like the most intuitive thing would be for the "r" in "_r_met" to
>> have the same semantic meaning as the "r" in "?r=" (i.e. "?r=" means "wait
>> for r copies of the same doc rev until a timeout" and "_r_met" would mean
>> "we got/didn't get r copies of the same doc rev within the timeout").
>> 
>> Just my two cents.
>> 
>> On Thu, Apr 2, 2015 at 1:22 AM, Robert Samuel Newson <[email protected]>
>> wrote:
>> 
>>> 
>>> Paul outlined his previous efforts to introduce this indication, and the
>>> problems he faced doing so. Can we come up with an acceptable mechanism?
>>> 
>>> A different status code will break a lot of users. While the http spec
>>> says you can treat any 2xx code as success, plenty of libraries, etc, only
>>> recognise 201 / 202 as successful write and 200 (and maybe 204, 206) for
>>> reads.
>>> 
>>> My preference is for a change that "can’t" break anyone, which I think
>>> only leaves an "X-CouchDB-R-Met: 2" response header, which isn’t the most
>>> pleasant thing.
>>> 
>>> Suggestions?
>>> 
>>> B.
>>> 
>>> 
>>>> On 1 Apr 2015, at 06:55, Mutton, James <[email protected]> wrote:
>>>> 
>>>> For at least my part of it, I agree with Adam. Bigcouch has made an
>>> effort to inform in the case of a failure to apply W. I've seen it lead to
>>> confusion when the same logic was not applied on R.
>>>> 
>>>> I also agree that W and R are not binding contracts. There's no
>>> agreement protocol to assure that W is met before being committed to disk.
>>> But they are exposed as a blocking parameter of the request, so
>>> notification being consistent appeared to me to be the best compromise (vs
>>> straight up removal).
>>>> 
>>>> </JamesM>
>>>> 
>>>> 
>>>>> On Mar 31, 2015, at 13:15, Robert Newson <[email protected]> wrote:
>>>>> 
>>>>> 
>>>>> If a way can be found that doesn't break things that can be sent in all
>>> or most cases, sure. It's what a user can really infer from that which I
>>> focused on. Not as much, I think, as users that want that info really want.
>>>>> 
>>>>> 
>>>>>> On 31 Mar 2015, at 21:08, Adam Kocoloski <[email protected]> wrote:
>>>>>> 
>>>>>> I hope we can all agree that CouchDB should inform the user when it is
>>> unable to satisfy the requested read "quorum".
>>>>>> 
>>>>>> Adam
>>>>>> 
>>>>>>> On Mar 31, 2015, at 3:20 PM, Paul Davis <[email protected]>
>>> wrote:
>>>>>>> 
>>>>>>> Sounds like there's a bit of confusion here.
>>>>>>> 
>>>>>>> What Nathan is asking for is the ability to have Couch respond with
>>> some
>>>>>>> information on the actual number of replicas that responded to a read
>>>>>>> request. That way a user could tell that they issued an r=2 request
>>> when
>>>>>>> only r=1 was actually performed. Depending on your point of view in
>>> an MVCC
>>>>>>> world this is either a bug or a feature. :)
>>>>>>> 
>>>>>>> It was generally agreed upon that if we could return this information
>>> it
>>>>>>> would be beneficial. Although what happened when I started
>>> implementing
>>>>>>> this patch was that we are either only able to return it in a subset
>>> of
>>>>>>> cases where it happens, return it inconsistently between various
>>> responses,
>>>>>>> or break replication.
>>>>>>> 
>>>>>>> The three general methods for this would be to either include a new
>>>>>>> "_r_met" key in the doc body that would be a boolean indicating if the
>>>>>>> requested read quorum was actually met for the document. The second
>>> was to
>>>>>>> return a custom X-R-Met type header, and lastly was the status code as
>>>>>>> described.
>>>>>>> 
>>>>>>> The _r_met member was thought to be the best, but unfortunately that
>>> breaks
>>>>>>> replication with older clients because we throw an error rather than
>>> ignore
>>>>>>> any unknown underscore prefixed field name. Thus having something
>>> that was
>>>>>>> just dynamically injected into the document body was a non-starter.
>>>>>>> Unfortunately, if we don't inject into the document body then we limit
>>>>>>> ourselves to only the set of APIs where a single document is
>>> returned. This
>>>>>>> is due to both streaming semantics (we can't buffer an entire
>>> response in
>>>>>>> memory for large requests to _all_docs) as well as multi-doc
>>> responses (a
>>>>>>> single boolean doesn't say which document may have not had a properly
>>> met
>>>>>>> R).
>>>>>>> 
>>>>>>> On top of that, the other confusing part of meeting the read quorum
>>> is that
>>>>>>> given MVCC semantics it becomes a bit confusing on how you respond to
>>>>>>> documents with different revision histories. For instance, if we read
>>> two
>>>>>>> docs, we have technically made the r=2 requirement, but what should
>>> our
>>>>>>> response be if those two revisions are different (technically, in
>>> this case
>>>>>>> we wait for the third response, but the decision on what to return
>>> for the
>>>>>>> "r met" value is still unclear).
>>>>>>> 
>>>>>>> While I think everyone is in agreement that it'd be nice to return
>>> some of
>>>>>>> the information about the copies read, I think its much less clear
>>> what and
>>>>>>> how it should be returned in the multitude of cases that we can
>>> specify an
>>>>>>> value for R.
>>>>>>> 
>>>>>>> While that doesn't offer a concrete path forward, hopefully it
>>> clarifies
>>>>>>> some of the issues at hand.
>>>>>>> 
>>>>>>> On Tue, Mar 31, 2015 at 1:47 PM, Robert Samuel Newson <
>>> [email protected]>
>>>>>>> wrote:
>>>>>>> 
>>>>>>>> 
>>>>>>>> It’s testament to my friendship with Mike that we can disagree on
>>> such
>>>>>>>> things and remain friends. I am sorry he misled you, though.
>>>>>>>> 
>>>>>>>> CouchDB 2.0 (like Cloudant) does not have read or write quorums at
>>> all, at
>>>>>>>> least in the formal sense, the only one that matters, this is
>>> unfortunately
>>>>>>>> sloppy language in too many places to correct.
>>>>>>>> 
>>>>>>>> The r= and w= parameters control only how many of the n possible
>>> responses
>>>>>>>> are collected before returning an http response.
>>>>>>>> 
>>>>>>>> It’s not true that returning 202 in the situation where one write is
>>> made
>>>>>>>> but fewer than 'r' writes are made means we’ve chosen availability
>>> over
>>>>>>>> consistency since even if we returned a 500 or closed the connection
>>>>>>>> without responding, a subsequent GET could return the document (a
>>>>>>>> probability that increases over time as anti-entropy makes the
>>> missing
>>>>>>>> copies). A write attempt that returned a 409 could, likewise,
>>> introduce a
>>>>>>>> new edit branch into the document, which might then 'win', altering
>>> the
>>>>>>>> results of a subsequent GET.
>>>>>>>> 
>>>>>>>> The essential thing to remember is this: the ’n’ copies of your data
>>> are
>>>>>>>> completely independent when written/read by the clustered layer
>>> (fabric).
>>>>>>>> It is internal replication (anti-entropy) that converges those
>>> copies,
>>>>>>>> pair-wise, to the same eventual state. Fabric is converting the 3
>>>>>>>> independent results into a single result as best it can. Older
>>> versions did
>>>>>>>> not expose the 201 vs 202 distinction, calling both of them 201. I
>>> do agree
>>>>>>>> with you that there’s little value in the 202 distinction. About the
>>> only
>>>>>>>> thing you could do is investigate your cluster for connectivity
>>> issues or
>>>>>>>> overloading if you get a sustained period of 202’s, as it would be an
>>>>>>>> indicator that the system is partitioned.
>>>>>>>> 
>>>>>>>> In order to achieve your goals, CouchDB 2.0 would have to ensure
>>> that the
>>>>>>>> result of a write did not change after the fact. That is,
>>> anti-entropy
>>>>>>>> would need to be disabled, or somehow agree to roll forward or
>>> backward
>>>>>>>> based on the initial circumstances. In short, we’d have to introduce
>>> strong
>>>>>>>> consistency (paxos or raft or zab, say). While this would be a great
>>>>>>>> feature to add, it’s not currently present, and no amount of
>>> twiddling the
>>>>>>>> status codes will achieve it. We’d rather be honest about our
>>> position on
>>>>>>>> the CAP triangle.
>>>>>>>> 
>>>>>>>> B.
>>>>>>>> 
>>>>>>>> 
>>>>>>>>>> On 30 Mar 2015, at 22:37, Nathan Vander Wilt <
>>> [email protected]>
>>>>>>>>> wrote:
>>>>>>>>> 
>>>>>>>>> A technical co-founder of Cloudant agreed that this was a bug when I
>>>>>>>> first hit it a few years ago. I found back the original thread here
>>> — this
>>>>>>>> is the discussion I was trying to recall in my OP:
>>>>>>>>> It sounds like perhaps there is a related issue tracked internally
>>> at
>>>>>>>> Cloudant as a result of that conversation.
>>>>>>>>> 
>>>>>>>>> JamesM, thanks for your support here and tracking this down. 203
>>> seemed
>>>>>>>> like the best status code to "steal" for this to me too. Best wishes
>>> in
>>>>>>>> getting this fixed!
>>>>>>>>> 
>>>>>>>>> regards,
>>>>>>>>> -natevw
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>>> On Mar 25, 2015, at 4:49 AM, Robert Newson <[email protected]>
>>> wrote:
>>>>>>>>>> 
>>>>>>>>>> 2.0 is explicitly an AP system, the behaviour you describe is not
>>>>>>>> classified as a bug.
>>>>>>>>>> 
>>>>>>>>>> Anti-entropy is the main reason that you cannot get strong
>>> consistency
>>>>>>>> from the system, it will transform "failed" writes (those that
>>> succeeded on
>>>>>>>> one node but fewer than R nodes) into success (N copies) as long as
>>> the
>>>>>>>> nodes have enough healthy uptime.
>>>>>>>>>> 
>>>>>>>>>> True of cloudant and 2.0.
>>>>>>>>>> 
>>>>>>>>>> Sent from my iPhone
>>>>>>>>>> 
>>>>>>>>>>> On 24 Mar 2015, at 15:14, Mutton, James <[email protected]>
>>> wrote:
>>>>>>>>>>> 
>>>>>>>>>>> Funny you should mention it.  I drafted an email in early
>>> February to
>>>>>>>> queue up the same discussion whenever I could get involved again
>>> (which I
>>>>>>>> promptly forgot about).  What happens currently in 2.0 appears
>>> unchanged
>>>>>>>> from earlier versions.  When R is not satisfied in fabric,
>>>>>>>> fabric_doc_open:handle_message eventually responds with a {stop, …}
>>> but
>>>>>>>> leaves the acc-state as the original r_not_met which triggers a
>>> read_repair
>>>>>>>> from the response handler.  read_repair results in an {ok, …} with
>>> the only
>>>>>>>> doc available, because no other docs are in the list.  The final doc
>>>>>>>> returned to chttpd_db:couch_doc_open and thusly to
>>> chttpd_db:db_doc_req is
>>>>>>>> simply {ok, Doc}, which has now lost the fact that the answer was not
>>>>>>>> complete.
>>>>>>>>>>> 
>>>>>>>>>>> This seems straightforward to fix by a change in
>>>>>>>> fabric_open_doc:handle_response and read_repair.  handle_response
>>> knows
>>>>>>>> whether it has R met and could pass that forward, or allow
>>> read-repair to
>>>>>>>> pass it forward if read_repair is able to satisfy acc.r.  I can’t
>>> speak for
>>>>>>>> community interest in the behavior of sending a 202, but it’s
>>> something I’d
>>>>>>>> definitely like for the same reasons you cite.  Plus it just seems
>>>>>>>> disconnected to do it on writes but not reads.
>>>>>>>>>>> 
>>>>>>>>>>> Cheers,
>>>>>>>>>>> </JamesM>
>>>>>>>>>>> 
>>>>>>>>>>>> On Mar 24, 2015, at 14:06, Nathan Vander Wilt <
>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>> 
>>>>>>>>>>>> Sorry, I have not been following CouchDB 2.0 roadmap but I was
>>>>>>>> extending my fermata-couchdb plugin today and realized that perhaps
>>> the
>>>>>>>> Apache release of BigCouch as CouchDB 2.0 might provide an
>>> opportunity to
>>>>>>>> fix a serious issue I had using Cloudant's implementation.
>>>>>>>>>>>> 
>>>>>>>>>>>> See
>>>>>>>> https://github.com/cloudant/bigcouch/issues/55#issuecomment-30186518
>>> for
>>>>>>>> some additional background/explanation, but my understanding is that
>>>>>>>> Cloudant for all practical purposes ignores the read durability
>>> parameter.
>>>>>>>> So you can write with ?w=N to attempt some level of quorum, and get
>>> a 202
>>>>>>>> back if that quorum is unment. _However_ when you ?r=N it really
>>> doesn't
>>>>>>>> matter if only <N nodes are available…if even just a single
>>> available node
>>>>>>>> has some version of the requested document you will get a successful
>>>>>>>> response (!).
>>>>>>>>>>>> 
>>>>>>>>>>>> So in practice, there's no way to actually use the quasi-Dynamo
>>>>>>>> features to dynamically _choose_ between consistency or availability
>>> — when
>>>>>>>> it comes time to read back a consistent result, BigCouch instead just
>>>>>>>> always gives you availability* regardless of what a given request
>>> actually
>>>>>>>> needs. (In my usage I ended up treating a 202 write as a 500, rather
>>> than
>>>>>>>> proceeding with no way of ever knowing whether a write did NOT
>>> ACTUALLY
>>>>>>>> conflict or just hadn't YET because $who_knows_how_many nodes were
>>> still
>>>>>>>> down…)
>>>>>>>>>>>> 
>>>>>>>>>>>> IIRC, this was both confirmed and acknowledged as a serious bug
>>> by a
>>>>>>>> Cloudant engineer (or support personnel at least) but could not be
>>> quickly
>>>>>>>> fixed as it could introduce backwards-compatibility concerns. So…
>>>>>>>>>>>> 
>>>>>>>>>>>> Is CouchDB 2.0 already breaking backwards compatibility with
>>>>>>>> BigCouch? If true, could this read durability issue now be fixed
>>> during the
>>>>>>>> merge?
>>>>>>>>>>>> 
>>>>>>>>>>>> thanks,
>>>>>>>>>>>> -natevw
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> * DISCLAIMER: this statement has not been endorsed by actual
>>> uptime
>>>>>>>> of *any* Couch fork…
>>>>>> 
>>> 
>>> 
>

Re: Could CouchDB 2.0 fix actual read quorum?

Reply via email to