N=1 might reduce some of the issues, but it won't eliminate the problem entirely. The fundamental issue is that the "_dbs" db, which contains a document corresponding to every clustered database in the system, does not provide immediate consistency guarantees, and cycling databases can result in conflicts arising in these docs. The docs contain the shard/node mappings and conflicts can cause different nodes to have different views of the world.
It's important to remember that the "_dbs" db powers the db -> shards mapping and is a fundamental component of the quorum system, so unfortunately the standard clustered quorum semantics are not available in the "_dbs" db as it operates at a lower level. You can see the initial synchronization during bootup in [1] which circles its way back to [2] by way of mem3_sync_nodes.erl. You can further see where the "_dbs" db is a local db in the way in which shards are loaded in [3] and the fallback for creating the "_dbs" db in [4]. So in summary, the "_dbs" db operates at a lower level than the quorum system as the db is a core component that powers the shard mappings, and therefore uses a different approach for synchronization where each node has a full copy of the "_dbs" db and syncs directly with the other nodes. This is a known weak point as can be seen by the impact of cycling databases too quickly, and so recommended best practice is to not cycle databases quickly. Obviously this is not ideal, and this is one of the areas where a CP config store of some sort would be a significant boon, but bolting on a CP system to an AP system is fraught with a new set of complexities. (A clarification on N=1: with N=1 you only have one replica of the database, and the database exists on only one node. The rest of the nodes still need to get the updated "_dbs" db doc so they know where the database exists, because any node in the cluster can handle any request and it will need to know where the database exists. In general, you have one coordinating node and N replica nodes containing the N replicas (of each shard) for the given database. In a three node cluster with N=3, whatever coordinating node the request is handled by will also have a local shard replica, but this is a special case. In a cluster with more than 3 nodes, say 15 nodes, the coordinating node will only have a 3/15 chance to contain a local shard (assuming round robin load balancing across nodes). So basically every node must know where every database exists because every node can coordinate every request.) -Russell [1] https://github.com/apache/couchdb-mem3/blob/15615b295ec970ca9b12b7b54107a80b95149511/src/mem3_sync.erl#L234-L236 [2] https://github.com/apache/couchdb-mem3/blob/15615b295ec970ca9b12b7b54107a80b95149511/src/mem3_sync.erl#L230-L232 [3] https://github.com/apache/couchdb-mem3/blob/699308f510d335d05bfd0416ad5e893b68a7ec1d/src/mem3_shards.erl#L266-L283 [4] https://github.com/apache/couchdb-mem3/blob/699308f510d335d05bfd0416ad5e893b68a7ec1d/src/mem3_util.erl#L214-L222 On Fri, Sep 2, 2016 at 10:43 AM, Nolan Lawson <[email protected]> wrote: > Thanks, Dale. That was my recollection as well. > > Basically PouchDB does PUT -> DELETE -> PUT between every test, so since > there are 1000s of tests, this race condition comes up pretty easily. We > can add a timeout or do a random DB name, but without doing that we don't > know if Couch 2.x is truly "passing" the test suite or not. > > I have some time this weekend, so I'll look into adding a patch to do the > workaround for Couch 2. I tend to side with Jan that in a clustered system > it can't reliably tell us when a database was truly deleted without > sacrificing the A in CAP. PouchDB users are already familiar with the weird > ways that databases start to behave when you actually DELETE them (e.g. > replication gets unreliable), hence workarounds like > https://www.npmjs.com/package/pouchdb-erase . In practice I expect PouchDB > users to never delete databases, so this is just an artifact of our test > suite IMO. > > –Nolan > > > On Fri, Sep 2, 2016 at 3:14 AM, Dale Harvey <[email protected]> wrote: > > > In PouchDB we can look into a workaround that uses random names only when > > the tests are run against Couch 2.0, however I would really like to make > > sure that a database not being fully deleted when we get a successful > > confirmation of deletion is considered a bug, it has impacts beyond the > > test suite, its really hard to create a reliable system when there is no > > way for you to be certain when a database is deleted. > > > > Will found it easiest to reproduce this using concurrent scripts but > would > > like to clarify that Pouch doesnt run the test suite in parallel, this > bug > > can be hit by doing CREATE -> DELETE -> CREATE, its extremely hard to > nail > > down and reproduce (the similiar bug in PouchDB took many attempts + > > months). I will take a look at seeing if I can make an easier and clearer > > steps to reproduce. > > > > On 2 September 2016 at 11:01, Jan Lehnardt <[email protected]> wrote: > > > > > > > > > On 02 Sep 2016, at 11:58, Will Holley <[email protected]> wrote: > > > > > > > > Jan - I can understand that being the case in a clustered setup with > > > > distributed shard maps but shouldn't n=1 mitigate that? > > > > > > n=1 still does q=8 (8 shards per node) and the software makes > > > noconsistency guarantees whatsoever. > > > > > > n=1 && q=1 might work as a side-effect, but not sure how that is useful > > > for reliable tests :) > > > > > > Best > > > Jan > > > -- > > > > > > > > > > > > > > On 2 September 2016 at 10:53, Jan Lehnardt <[email protected]> wrote: > > > >> > > > >>> On 02 Sep 2016, at 11:45, Dale Harvey <[email protected]> wrote: > > > >>> > > > >>> In PouchDB we used to generate unique database names for tests, > > > however we > > > >>> removed it for serveral reasons, one large reason being it > indicates > > a > > > race > > > >>> condition in critical code if we cannot reliably create -> delete > -> > > > create > > > >>> the same database (we have uncovered and fixed a lot of bugs in > > > PouchDB due > > > >>> to this). While its not my call how to prioritise those bugs, I > > really > > > do > > > >>> not think we should be closing what are fairly serious bugs because > > it > > > >>> wasnt inconvenient to workaround them in the couch test suite. > > > >> > > > >> It’s just that a CouchDB 2.0 cluster is an AP system, and recreating > > > databases > > > >> in quick succession reliably basically requires a CA system and > that’s > > > not what can do easily. > > > >> > > > >> (I hope I got the CAP letters right, but I think it is clear what I > > > mean) > > > >> > > > >> That is, maybe we skip those tests when run against a CouchDB 2.0 > > > endpoint and keep them for PouchDB? > > > >> > > > >> Best > > > >> Jan > > > >> -- > > > >> > > > >> > > > >>> > > > >>> On 2 September 2016 at 10:31, Joan Touzet <[email protected]> > wrote: > > > >>> > > > >>>> Hi Nolan, Will: > > > >>>> > > > >>>> A further update from looking deeper with @janl. It appears that > we > > > >>>> have a pending fix for COUCHDB-3017 and we'll work on getting that > > > >>>> merged before 2.0. > > > >>>> > > > >>>> COUCHDB-3034 is a WONTFIX. FYI in CouchDB itself we changed all of > > > >>>> our tests to use unique database names. I'll update the bug myself > > > >>>> shortly. > > > >>>> > > > >>>> -Joan > > > >>>> > > > >>>> ----- Original Message ----- > > > >>>>> From: "Joan Touzet" <[email protected]> > > > >>>>> To: [email protected] > > > >>>>> Sent: Friday, September 2, 2016 5:15:00 AM > > > >>>>> Subject: Re: Getting libraries to test RCs > > > >>>>> > > > >>>>> Hi Will, > > > >>>>> > > > >>>>> Neither of these are currently tagged as blocking issues for > > CouchDB > > > >>>>> 2.0, only major priority. If you want to flag them as such, this > is > > > >>>>> your last chance, and even still, there's no guarantee fixes for > > them > > > >>>>> will hit 2.0. > > > >>>>> > > > >>>>> Erlangers, is there any chance of at least triaging these today? > > > >>>>> > > > >>>>> -Joan > > > >>>>> > > > >>>>> ----- Original Message ----- > > > >>>>>> From: "Will Holley" <[email protected]> > > > >>>>>> To: [email protected], "Joan Touzet" <[email protected]> > > > >>>>>> Sent: Friday, September 2, 2016 4:43:48 AM > > > >>>>>> Subject: Re: Getting libraries to test RCs > > > >>>>>> > > > >>>>>> Assuming nothing's changed in the last few weeks, there are 2 > > > >>>>>> issues > > > >>>>>> which cause the PouchDB tests to fail against master: > COUCHDB-3017 > > > >>>>>> and > > > >>>>>> COUCHDB-3034. > > > >>>>>> > > > >>>>>> Both could be addressed in the test suite by using different > > > >>>>>> database > > > >>>>>> names for each test, but that's quite a disruptive change. > > > >>>>>> > > > >>>>>> On 2 September 2016 at 03:15, Joan Touzet <[email protected]> > > > >>>>>> wrote: > > > >>>>>>> Hi Nolan, you state that it's 'failing for known reasons.' Is > > > >>>>>>> that > > > >>>>>>> reasons in PouchDB or anything you need to push back on us? > We'd > > > >>>>>>> like > > > >>>>>>> to know ASAP as we're very, very close to releasing 2.0 now. > > > >>>>>>> > > > >>>>>>> I have zero PouchDB knowledge so I'm hoping you can give us a > > > >>>>>>> short > > > >>>>>>> summary of what you think is wrong. > > > >>>>>>> > > > >>>>>>> All the best, > > > >>>>>>> Joan > > > >>>>>>> > > > >>>>>>> ----- Original Message ----- > > > >>>>>>>> From: "Nolan Lawson" <[email protected]> > > > >>>>>>>> To: [email protected] > > > >>>>>>>> Sent: Thursday, September 1, 2016 7:56:42 PM > > > >>>>>>>> Subject: Re: Getting libraries to test RCs > > > >>>>>>>> > > > >>>>>>>> We have been testing CouchDB master in PouchDB for months now, > > > >>>>>>>> but > > > >>>>>>>> as > > > >>>>>>>> an allowed failure because I believe it’s failing for known > > > >>>>>>>> reasons. > > > >>>>>>>> We test both using Node.js and the browser. > > > >>>>>>>> > > > >>>>>>>> Node: https://travis-ci.org/pouchdb/pouchdb/jobs/156198210 > > > >>>>>>>> Browser: https://travis-ci.org/pouchdb/pouchdb/jobs/156198211 > > > >>>>>>>> > > > >>>>>>>> For anyone who wants to run the Pouch test suite against > > > >>>>>>>> CouchDB, > > > >>>>>>>> it’s just: > > > >>>>>>>> > > > >>>>>>>> git clone https://github.com/pouchdb/pouchdb.git > > > >>>>>>>> cd pouchdb > > > >>>>>>>> npm I > > > >>>>>>>> COUCH_HOST=http://localhost:5984 BAIL=0 npm t > > > >>>>>>>> > > > >>>>>>>> BAIL=0 will tell it to run the full test suite and not stop on > > > >>>>>>>> any > > > >>>>>>>> failures. That way you can inspect the failures and see if > > > >>>>>>>> they’re > > > >>>>>>>> serious or not. > > > >>>>>>>> > > > >>>>>>>> Cheers, > > > >>>>>>>> Nolan > > > >>>>>>>> > > > >>>>>>>>> On Aug 29, 2016, at 12:15 PM, Jan Lehnardt <[email protected]> > > > >>>>>>>>> wrote: > > > >>>>>>>>> > > > >>>>>>>>> Anyone on this list who could help with this? The work items > > > >>>>>>>>> are > > > >>>>>>>>> fairly self-explanatory and not very big individually <3 > > > >>>>>>>>> > > > >>>>>>>>> Best > > > >>>>>>>>> Jan > > > >>>>>>>>> -- > > > >>>>>>>>> > > > >>>>>>>>>> On 10 Aug 2016, at 09:37, Jan Lehnardt <[email protected]> > > > >>>>>>>>>> wrote: > > > >>>>>>>>>> > > > >>>>>>>>>> Hey everyone, > > > >>>>>>>>>> > > > >>>>>>>>>> from Joan’s excellent blog post about testing Release > > > >>>>>>>>>> Candidates: > > > >>>>>>>>>> > > > >>>>>>>>>>> To our valued CouchDB application and library developers: > > > >>>>>>>>>>> please, > > > >>>>>>>>>>> please run your software against each of the options below. > > > >>>>>>>>>> > > > >>>>>>>>>> — https://blog.couchdb.org/2016/08/08/release-candidates/ > > > >>>>>>>>>> > > > >>>>>>>>>> I think we can be a little more proactive about this for > > > >>>>>>>>>> CouchDB > > > >>>>>>>>>> client libraries: let’s open issues on all the > > > >>>>>>>>>> CouchDB-compatible > > > >>>>>>>>>> client software we care about to test an RC. > > > >>>>>>>>>> > > > >>>>>>>>>> Since there are a lot of projects, and we don’t necessarily > > > >>>>>>>>>> know > > > >>>>>>>>>> which one we “care” about, we should try to be clever about > > > >>>>>>>>>> it. > > > >>>>>>>>>> > > > >>>>>>>>>> Maybe something like this can work: > > > >>>>>>>>>> > > > >>>>>>>>>> 1. We prepare an issue text explaining the thing: Heya, > > > >>>>>>>>>> CouchDB > > > >>>>>>>>>> team here, major new version coming up, you should test it > > > >>>>>>>>>> like > > > >>>>>>>>>> so: <include instructions to test against a 3-node cluster. > > > >>>>>>>>>> Maybe > > > >>>>>>>>>> even provide a cluster to do this, or Cloudant can sponsor > > > >>>>>>>>>> something? > > > >>>>>>>>>> > > > >>>>>>>>>> 2. Post this message with a call to action on [email protected], > the > > > >>>>>>>>>> weekly news, and our other (social) media channels. > > > >>>>>>>>>> > > > >>>>>>>>>> 3. Ask people who submitted an issue to report back with a > > > >>>>>>>>>> link. > > > >>>>>>>>>> > > > >>>>>>>>>> 4. Collect the link in an issue or JIRA (this could be done > > > >>>>>>>>>> in > > > >>>>>>>>>> 3., > > > >>>>>>>>>> but then everybody needs to be added to the wiki write > group, > > > >>>>>>>>>> and > > > >>>>>>>>>> that’s just extra overhead we don’t need). Maybe we borrow a > > > >>>>>>>>>> gist > > > >>>>>>>>>> for this, or a Google doc. > > > >>>>>>>>>> > > > >>>>>>>>>> That way we encourage client software to check out RCs and > we > > > >>>>>>>>>> can > > > >>>>>>>>>> keep track, while the community helps to select which > > > >>>>>>>>>> software > > > >>>>>>>>>> to > > > >>>>>>>>>> encourage to test 2.0 compat, and helps spread the word and > > > >>>>>>>>>> the > > > >>>>>>>>>> burden is not left with just a few folks. > > > >>>>>>>>>> > > > >>>>>>>>>> What do you think? > > > >>>>>>>>>> > > > >>>>>>>>>> Best > > > >>>>>>>>>> Jan > > > >>>>>>>>>> -- > > > >>>>>>>>>> > > > >>>>>>>>> > > > >>>>>>>>> -- > > > >>>>>>>>> Professional Support for Apache CouchDB: > > > >>>>>>>>> https://neighbourhood.ie/couchdb-support/ > > > >>>>>>>>> > > > >>>>>>>> > > > >>>>>>>> > > > >>>>>> > > > >>>>> > > > >>>> > > > >> > > > >> -- > > > >> Professional Support for Apache CouchDB: > > > >> https://neighbourhood.ie/couchdb-support/ > > > >> > > > > > > -- > > > Professional Support for Apache CouchDB: > > > https://neighbourhood.ie/couchdb-support/ > > > > > > > > > > > > -- > Nolan Lawson > nolanlawson.com > github.com/nolanlawson >
