[
https://issues.apache.org/jira/browse/COUCHDB-2310?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14181604#comment-14181604
]
Daniel Holth commented on COUCHDB-2310:
---------------------------------------
I've been able to do some additional work on this and am able to release under
the Apache 2.0 license. It is at
https://github.com/dholth/pouchdb/compare/fast-enough
It contains the previously mentioned lua shim adding a _bulk_get API to
CouchDB, the necessary nginx reverse proxy config and initial support for the
feature in PouchDB. POST an array of GET parameters as the json request body to
_bulk_get, many GET subrequests are made, and the results are concatenated and
returned as a json array.
I only chose CouchDB because I thought it could correctly replicate to PouchDB
in a reasonable amount of time. For a database of 1639 documents, none rev-1,
PouchDB's current replication algorithm makes 1,718 requests to the server.
After the change it takes 98 requests to do the same thing. Now replication is
fast enough to be useful.
I might have produced a patch to CouchDB itself but I do not know Erlang and I
already need to run CouchDB behind nginx for authentication reasons.
> Add a bulk API for revs & open_revs
> -----------------------------------
>
> Key: COUCHDB-2310
> URL: https://issues.apache.org/jira/browse/COUCHDB-2310
> Project: CouchDB
> Issue Type: Bug
> Security Level: public(Regular issues)
> Components: HTTP Interface
> Reporter: Nolan Lawson
>
> CouchDB replication is too slow.
> And what makes it so slow is that it's just so unnecessarily chatty. During
> replication, you have to do a separate GET for each individual document, in
> order to get the full {{_revisions}} object for that document (using the
> {{revs}} and {{open_revs}} parameters – refer to [the TouchDB
> writeup|https://github.com/couchbaselabs/TouchDB-iOS/wiki/Replication-Algorithm]
> or [Benoit's writeup|http://dataprotocols.org/couchdb-replication/] if you
> need a refresher).
> So for example, let's say you've got a database full of 10,000 documents, and
> you replicate using a batch size of 500 (batch sizes are configurable in
> PouchDB). The conversation for a single batch basically looks like this:
> {code}
> - REPLICATOR: gimme 500 changes since seq X (1 GET request)
> - SOURCE: okay
> - REPLICATOR: gimme the _revs_diff for these 500 docs/_revs (1 POST request)
> - SOURCE: okay
> - repeat 500 times:
> - REPLICATOR: gimme the _revisions for doc n with _revs [...] (1 GET
> request)
> - SOURCE: okay
> - REPLICATOR: here's a _bulk_docs with 500 documents (1 POST request)
> - TARGET: okay
> {code}
> See the problem here? That 500-loop, where we have to do a GET for each one
> of 500 documents, is a lot of unnecessary back-and-forth, considering that
> the replicator already knows what it needs before the loop starts. You can
> parallelize, but if you assume a browser (e.g. for PouchDB), most browsers
> only let you do ~8 simultaneous requests at once. Plus, there's latency and
> HTTP headers to consider. So overall, it's not cool.
> So why do we even need to do the separate requests? Shouldn't {{_all_docs}}
> be good enough? Turns out it's not, because we need this special
> {{_revisions}} object.
> For example, consider a document {{'foo'}} with 10 revisions. You may compact
> the database, in which case revisions {{1-x}} through {{9-x}} are no longer
> retrievable. However, if you query using {{revs}} and {{open_revs}}, those
> rev IDs are still available:
> {code}
> $ curl 'http://nolan.iriscouch.com/test/foo?revs=true&open_revs=all'
> {
> "_id": "foo",
> "_rev": "10-c78e199ad5e996b240c9d6482907088e",
> "_revisions": {
> "start": 10,
> "ids": [
> "c78e199ad5e996b240c9d6482907088e",
> "f560283f1968a05046f0c38e468006bb",
> "0091198554171c632c27c8342ddec5af",
> "e0a023e2ea59db73f812ad773ea08b17",
> "65d7f8b8206a244035edd9f252f206ad",
> "069d1432a003c58bdd23f01ff80b718f",
> "d21f26bb604b7fe9eba03ce4562cf37b",
> "31d380f99a6e54875855e1c24469622d",
> "3b4791360024426eadafe31542a2c34b",
> "967a00dff5e02add41819138abb3284d"
> ]
> }
> }
> {code}
> And in the replication algorithm, _this full \_revisions object is required_
> at the point when you copy the document from one database to another, which
> is accomplished with a POST to {{_bulk_docs}} using {{new_edits=false}}. If
> you don't have the full {{_revisions}} object, CouchDB accepts the new
> revision, but considers it to be a conflict. (The exception is with
> generation-1 documents, since they have no history, so as it says in the
> TouchDB writeup, you can safely just use {{_all_docs}} as an optimization for
> such documents.)
> And unfortunately, this {{_revision}} object is only available from the {{GET
> /:dbid/:docid}} endpoint. Trust me; I've tried the other APIs. You can't get
> it anywhere else.
> This is a huge problem, especially in PouchDB where we often have to deal
> with CORS, meaning the number of HTTP requests is doubled. So for those 500
> GETs, it's an extra 500 OPTIONs, which is just unacceptable.
> Replication does not have to be slow. While we were experimenting with ways
> of fetching documents in bulk, we tried a technique that just relied on using
> {{_changes}} with {{include_docs=true}}
> ([|\#2472|https://github.com/pouchdb/pouchdb/pull/2472]). This pushed
> conflicts into the target database, but on the upside, you can sync ~95k
> documents from npm's skimdb repository to the browser in less than 20
> minutes! (See [npm-browser.com|http://npm-browser.com] for a demo.)
> What an amazing story we could tell about the beauty of CouchDB replication,
> if only this trick actually worked!
> My proposal is a simple one: just add the {{revs}} and {{open_revs}} options
> to {{_all_docs}}. Presumably this would be aligned with {{keys}}, so similar
> to how {{keys}} takes an array of docIds, {{open_revs}} would take an array
> of array of revisions. {{revs}} would just be a boolean.
> This only gets hairy in the case of deleted documents. In this example,
> {{bar}} is deleted but {{foo}} is not:
> {code}
> curl -g
> 'http://nolan.iriscouch.com/test/_all_docs?keys=["bar","foo"]&include_docs=true'
> {"total_rows":1,"offset":0,"rows":[
> {"id":"bar","key":"bar","value":{"rev":"2-eec205a9d413992850a6e32678485900","deleted":true},"doc":null},
> {"id":"foo","key":"foo","value":{"rev":"10-c78e199ad5e996b240c9d6482907088e"},"doc":{"_id":"foo","_rev":"10-c78e199ad5e996b240c9d6482907088e"}}
> ]}
> {code}
> The cleanest would be to attach the {{_revisions}} object to the {{doc}}, but
> if you use {{keys}}, then the deleted documents are returned with {{doc:
> null}}, even if you specify {{include_docs=true}}. One workaround would be to
> simply add a {{revisions}} object to the {{value}}.
> If all of this would be too difficult to implement under the hood in CouchDB,
> I'd also be happy to get the {{_revisions}} back in {{_changes}},
> {{_revs_diff}}, or even in a separate endpoint. I don't care, as long as
> there is some bulk API where I can get multiple {{_revisions}} for multiple
> documents at once.
> On the PouchDB end of things, we would really like to push forward on this.
> I'm happy to implement a Node.js proxy to stand in front of
> CouchDB/Cloudant/CSG and add this new API, plus adding it directly to PouchDB
> Server. I can invent whatever API I want, but the main thing is that I would
> like this API to be something that all the major players can agree upon
> (Apache, Cloudant, Couchbase) so that eventually the proxy is no longer
> necessary.
> Thanks for reading the WoT. Looking forward to a faster CouchDB replication
> protocol, since it's the thing that ties us all together and makes this crazy
> experiment worthwhile.
> Background: [this|https://github.com/pouchdb/pouchdb/issues/2686] and
> [this|https://gist.github.com/nolanlawson/340cb898f8ed9f3db8a0].
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)