Re: Shard level querying in CouchDB Proposal

Geoffrey Cox Wed, 22 Nov 2017 10:40:03 -0800

Hi Mike, this sounds like a pretty cool enhancement. Just to clarify,
you're also proposing modifying the PUT/POST doc, etc... so that you can
specify a shard key per doc so that the doc can be stored on a specific
shard?


On Wed, Nov 22, 2017 at 5:57 AM Mike Rhodes <mrho...@linux.vnet.ibm.com>
wrote:

> All,
>
> Just to be clear before starting: this is a proposal from Cloudant not
> just myself :D
>
> ## What?
>
> Introduce a user-specified shard key per document and a way for the user
> to scope queries to a single shard using this key, thereby reducing query
> latency by allowing the database to consult only the shard required to
> answer the user's query. Documents can share a shard key and all documents
> with a given shard key are written to the same shard. The shard key
> therefore overrides using the document ID as the key to shard by.
>
> ## What problem does this proposal address?
>
> To improve performance and throughput of queries generally, and
> specifically to significantly decrease the effects of data size on query
> latency and cluster load leading to better data scaling characteristics of
> the database.
>
> CouchDB's current queries -- whether to views, search indexes or mango --
> perform badly when the shard count (Q) for a database is high. This is
> because the coordinator node can't know in advance which shards hold
> results for the query, so must make a scatter-gather request to all shards,
> even those that in the end return no results. In turn this generates more
> load on all servers in the cluster because of many possibly redundant file
> reads.
>
> We did some hacky performance testing of this idea by modifying the
> current view API to allow one to specify a specific view shard in the
> query. The results were very promising, as they showed that queries started
> scaling with similar characteristics to lookups, as you'd hope for, in that
> Q ceased having such a large effect and load on the cluster decreased.
>
> ## API
>
> There are a few general considerations which act as constraints on the
> API. To get these out in the open:
>
> • The combined <shardkey>:<dockey> must be a valid document id. This
> enforces uniqueness of shardkey:docid across the database and invariance
> between document updates.
> • Aside from calculation of shard location, we treat the combined
> <shardkey>:<dockey> as we do any other document _id.
> • All existing tooling should continue to work as-is (e.g., changes feed
> doesn't have to change to specify a shardkey:dockey pair and the replicator
> will "just work").
> • For Cloudant, it's useful to be able to differentiate queries by path
> for our internal accounting.
>
> For this we have a documentid becoming a composite thing:
> `<shardkey>:<documentkey>`
>
> • shardkey can be any valid characters for a document ID, but must not
> start with an _ or contain a : or /.
> • documentkey can be any valid characters for a document ID. Further :
> characters are treated as part of the key; meaningless.
>
> From this, our proposed API is, which should also help clarify a bit how
> things work:
>
> • Enable shard keys using database create: `PUT /db?use_shard_key=true`.
> Default false.
>         • For databases where this is enabled, every document needs a
> shard key.
> • Upload a document: `PUT /db/<shardkey>:<dockey>`
>         • In these databases, documents need user-specified document IDs
> as the shard key needs to be set by the user.
>         • Therefore reject POSTing documents to /db for simplicity (rather
> than having validation of an _id field).
> • Other things that create documents will need the validation of document
> ID too:
>         • Copy a document. Same as now but reject when destination
> document id does not have a shard key.
>         • Calls like `POST /db/_bulk_docs`, `POST
> /db/_design/mydoc/_update/myupdatefunction` and others that accept
> documents via POST bodies will need to inspect the body for shard keys.
> • Get a document: `GET /db/<shardkey>:<dockey>`.
> • Special documents:
>         • Design documents and local documents have no shardkey in their
> docID. This is enforced.
> • Reject documents with IDs of the form `<shardkey>:_design/foo`.
> • Query requests introduce a shardkey into the path, following an explicit
> `_shardkey` path part:
>         • Mango: `POST /db/_shardkey/<shardkey>/_find`.
>         • Views
>                 • `GET
> /db/_shardkey/<shardkey>/_design/mydoc/_view/myview?include_docs=true`
>                 • `POST
> /db/_shardkey/<shardkey>/_design/mydoc/_view/myview?include_docs=true`
>         • Query results are restricted to documents with the shard key
> specified. Which makes things harder but leaves the door open for future
> things like shard-splitting without changing result sets. And it seems like
> what one would expect!
> • The above is a white list of APIs, so a few notes on some cases I've
> left out with reasons:
>         • _all_docs (can see uses, but would rather keep API small to
> start with). Note that Mango depends on the internal _all_docs API.
>         • _changes (use-cases better served by explicit shard API).
>
> ## What are we NOT addressing here?
>
> This proposal only provides indirect shard addressing -- it's specifically
> not covering things discussed previously [1] per-shard changes feeds where
> one needs to directly address each shard. The shard key only controls the
> bucketing of documents into shards and doesn't say anything about, for
> example, whether two shard keys end up on the same shard -- that's an
> internal detail.
>
> I'd really appreciate thoughts and comments before Cloudant get started on
> implementation of this.
>
> Mike.
>
> [1]: https://issues.apache.org/jira/browse/COUCHDB-2791
>
>

Re: Shard level querying in CouchDB Proposal

Reply via email to