Re: Could CouchDB 2.0 fix actual read quorum?

Robert Samuel Newson Tue, 31 Mar 2015 11:48:47 -0700

It’s testament to my friendship with Mike that we can disagree on such things 
and remain friends. I am sorry he misled you, though.

CouchDB 2.0 (like Cloudant) does not have read or write quorums at all, at 
least in the formal sense, the only one that matters, this is unfortunately 
sloppy language in too many places to correct.

The r= and w= parameters control only how many of the n possible responses are 
collected before returning an http response.

It’s not true that returning 202 in the situation where one write is made but 
fewer than 'r' writes are made means we’ve chosen availability over consistency 
since even if we returned a 500 or closed the connection without responding, a 
subsequent GET could return the document (a probability that increases over 
time as anti-entropy makes the missing copies). A write attempt that returned a 
409 could, likewise, introduce a new edit branch into the document, which might 
then 'win', altering the results of a subsequent GET.

The essential thing to remember is this: the ’n’ copies of your data are 
completely independent when written/read by the clustered layer (fabric). It is 
internal replication (anti-entropy) that converges those copies, pair-wise, to 
the same eventual state. Fabric is converting the 3 independent results into a 
single result as best it can. Older versions did not expose the 201 vs 202 
distinction, calling both of them 201. I do agree with you that there’s little 
value in the 202 distinction. About the only thing you could do is investigate 
your cluster for connectivity issues or overloading if you get a sustained 
period of 202’s, as it would be an indicator that the system is partitioned.

In order to achieve your goals, CouchDB 2.0 would have to ensure that the 
result of a write did not change after the fact. That is, anti-entropy would 
need to be disabled, or somehow agree to roll forward or backward based on the 
initial circumstances. In short, we’d have to introduce strong consistency 
(paxos or raft or zab, say). While this would be a great feature to add, it’s 
not currently present, and no amount of twiddling the status codes will achieve 
it. We’d rather be honest about our position on the CAP triangle.

B.

> On 30 Mar 2015, at 22:37, Nathan Vander Wilt <[email protected]> wrote:
> 
> A technical co-founder of Cloudant agreed that this was a bug when I first 
> hit it a few years ago. I found back the original thread here — this is the 
> discussion I was trying to recall in my OP: 
> It sounds like perhaps there is a related issue tracked internally at 
> Cloudant as a result of that conversation.
> 
> JamesM, thanks for your support here and tracking this down. 203 seemed like 
> the best status code to "steal" for this to me too. Best wishes in getting 
> this fixed!
> 
> regards,
> -natevw
> 
> 
> On Mar 25, 2015, at 4:49 AM, Robert Newson <[email protected]> wrote:
> 
>> 2.0 is explicitly an AP system, the behaviour you describe is not classified 
>> as a bug. 
>> 
>> Anti-entropy is the main reason that you cannot get strong consistency from 
>> the system, it will transform "failed" writes (those that succeeded on one 
>> node but fewer than R nodes) into success (N copies) as long as the nodes 
>> have enough healthy uptime. 
>> 
>> True of cloudant and 2.0. 
>> 
>> Sent from my iPhone
>> 
>>> On 24 Mar 2015, at 15:14, Mutton, James <[email protected]> wrote:
>>> 
>>> Funny you should mention it.  I drafted an email in early February to queue 
>>> up the same discussion whenever I could get involved again (which I 
>>> promptly forgot about).  What happens currently in 2.0 appears unchanged 
>>> from earlier versions.  When R is not satisfied in fabric, 
>>> fabric_doc_open:handle_message eventually responds with a {stop, …}  but 
>>> leaves the acc-state as the original r_not_met which triggers a read_repair 
>>> from the response handler.  read_repair results in an {ok, …} with the only 
>>> doc available, because no other docs are in the list.  The final doc 
>>> returned to chttpd_db:couch_doc_open and thusly to chttpd_db:db_doc_req is 
>>> simply {ok, Doc}, which has now lost the fact that the answer was not 
>>> complete.
>>> 
>>> This seems straightforward to fix by a change in 
>>> fabric_open_doc:handle_response and read_repair.  handle_response knows 
>>> whether it has R met and could pass that forward, or allow read-repair to 
>>> pass it forward if read_repair is able to satisfy acc.r.  I can’t speak for 
>>> community interest in the behavior of sending a 202, but it’s something I’d 
>>> definitely like for the same reasons you cite.  Plus it just seems 
>>> disconnected to do it on writes but not reads.
>>> 
>>> Cheers,
>>> </JamesM>
>>> 
>>>> On Mar 24, 2015, at 14:06, Nathan Vander Wilt <[email protected]> 
>>>> wrote:
>>>> 
>>>> Sorry, I have not been following CouchDB 2.0 roadmap but I was extending 
>>>> my fermata-couchdb plugin today and realized that perhaps the Apache 
>>>> release of BigCouch as CouchDB 2.0 might provide an opportunity to fix a 
>>>> serious issue I had using Cloudant's implementation.
>>>> 
>>>> See https://github.com/cloudant/bigcouch/issues/55#issuecomment-30186518 
>>>> for some additional background/explanation, but my understanding is that 
>>>> Cloudant for all practical purposes ignores the read durability parameter. 
>>>> So you can write with ?w=N to attempt some level of quorum, and get a 202 
>>>> back if that quorum is unment. _However_ when you ?r=N it really doesn't 
>>>> matter if only <N nodes are available…if even just a single available node 
>>>> has some version of the requested document you will get a successful 
>>>> response (!).
>>>> 
>>>> So in practice, there's no way to actually use the quasi-Dynamo features 
>>>> to dynamically _choose_ between consistency or availability — when it 
>>>> comes time to read back a consistent result, BigCouch instead just always 
>>>> gives you availability* regardless of what a given request actually needs. 
>>>> (In my usage I ended up treating a 202 write as a 500, rather than 
>>>> proceeding with no way of ever knowing whether a write did NOT ACTUALLY 
>>>> conflict or just hadn't YET because $who_knows_how_many nodes were still 
>>>> down…)
>>>> 
>>>> IIRC, this was both confirmed and acknowledged as a serious bug by a 
>>>> Cloudant engineer (or support personnel at least) but could not be quickly 
>>>> fixed as it could introduce backwards-compatibility concerns. So…
>>>> 
>>>> Is CouchDB 2.0 already breaking backwards compatibility with BigCouch? If 
>>>> true, could this read durability issue now be fixed during the merge?
>>>> 
>>>> thanks,
>>>> -natevw
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> * DISCLAIMER: this statement has not been endorsed by actual uptime of 
>>>> *any* Couch fork…
>>> 
>

Re: Could CouchDB 2.0 fix actual read quorum?

Reply via email to