Yes indeed Jan :) Thanks Mike for writing this up! A couple of comments on the 
proposal:

>       • For databases where this is enabled, every document needs a shard key.

What would happen if this constraint were relaxed, and documents without a “:” 
in their ID simply used the full ID as the shard key as is done now?

>       • Query results are restricted to documents with the shard key 
> specified. Which makes things harder but leaves the door open for future 
> things like shard-splitting without changing result sets. And it seems like 
> what one would expect!

I agree this is important. It took me a minute to remember the point here, 
which is that a query specifying a shard key needs to filter out results from 
different shard keys that happen to be colocated on the same shard.

Does the current query functionality still work as it did before in a database 
without shard keys? That is, can I still issue a query without specifying a 
shard key and have it collate a response from the full dataset? I think this is 
worth addressing explicitly. My assumption is that it does, although I’m 
worried that there may be a problematic interaction if one tried to use the 
same physical index to satisfy both a “global” query and a query specifying a 
shard key.

Cheers, Adam

> On Nov 24, 2017, at 11:45 AM, Jan Lehnardt <j...@apache.org> wrote:
> 
> Hi Mike,
> 
> this is a great proposal, thanks a lot! I like this general idea, I guess 
> this was what Adam was hinting at in the _access draft review ;)
> 
> I left one comments inline.
> 
>> On 22. Nov 2017, at 14:56, Mike Rhodes <mrho...@linux.vnet.ibm.com> wrote:
>> 
>> All,
>> 
>> Just to be clear before starting: this is a proposal from Cloudant not just 
>> myself :D
>> 
>> ## What?
>> 
>> Introduce a user-specified shard key per document and a way for the user to 
>> scope queries to a single shard using this key, thereby reducing query 
>> latency by allowing the database to consult only the shard required to 
>> answer the user's query. Documents can share a shard key and all documents 
>> with a given shard key are written to the same shard. The shard key 
>> therefore overrides using the document ID as the key to shard by.
>> 
>> ## What problem does this proposal address?
>> 
>> To improve performance and throughput of queries generally, and specifically 
>> to significantly decrease the effects of data size on query latency and 
>> cluster load leading to better data scaling characteristics of the database.
>> 
>> CouchDB's current queries -- whether to views, search indexes or mango -- 
>> perform badly when the shard count (Q) for a database is high. This is 
>> because the coordinator node can't know in advance which shards hold results 
>> for the query, so must make a scatter-gather request to all shards, even 
>> those that in the end return no results. In turn this generates more load on 
>> all servers in the cluster because of many possibly redundant file reads.
>> 
>> We did some hacky performance testing of this idea by modifying the current 
>> view API to allow one to specify a specific view shard in the query. The 
>> results were very promising, as they showed that queries started scaling 
>> with similar characteristics to lookups, as you'd hope for, in that Q ceased 
>> having such a large effect and load on the cluster decreased.
>> 
>> ## API
>> 
>> There are a few general considerations which act as constraints on the API. 
>> To get these out in the open:
>> 
>> • The combined <shardkey>:<dockey> must be a valid document id. This 
>> enforces uniqueness of shardkey:docid across the database and invariance 
>> between document updates.
>> • Aside from calculation of shard location, we treat the combined 
>> <shardkey>:<dockey> as we do any other document _id.
>> • All existing tooling should continue to work as-is (e.g., changes feed 
>> doesn't have to change to specify a shardkey:dockey pair and the replicator 
>> will "just work").
>> • For Cloudant, it's useful to be able to differentiate queries by path for 
>> our internal accounting.
>> 
>> For this we have a documentid becoming a composite thing: 
>> `<shardkey>:<documentkey>`
>> 
>> • shardkey can be any valid characters for a document ID, but must not start 
>> with an _ or contain a : or /.
>> • documentkey can be any valid characters for a document ID. Further : 
>> characters are treated as part of the key; meaningless.
>> 
>> From this, our proposed API is, which should also help clarify a bit how 
>> things work:
>> 
>> • Enable shard keys using database create: `PUT /db?use_shard_key=true`. 
>> Default false.
> 
> Since this is the first of “different kind of database”, we should think 
> about how to do this more generally, as I expect more of these options to 
> come soon as well, _access comes to mind. Would this then be e.g.: `PUT 
> /db?use_shard_keys=true&per_doc_access=true`? It might make sense to put a 
> little bit of thought into how to name these options and how to 
> enable/disable them. This is a total bikeshed (although one CouchDB is famous 
> for), this won’t block this overall proposal, there will be a way to 
> enable/disable this, and how exactly is TBD.
> 
>>      • For databases where this is enabled, every document needs a shard key.
>> • Upload a document: `PUT /db/<shardkey>:<dockey>`
>>      • In these databases, documents need user-specified document IDs as the 
>> shard key needs to be set by the user.
>>      • Therefore reject POSTing documents to /db for simplicity (rather than 
>> having validation of an _id field).
>> • Other things that create documents will need the validation of document ID 
>> too:
>>      • Copy a document. Same as now but reject when destination document id 
>> does not have a shard key.
>>      • Calls like `POST /db/_bulk_docs`, `POST 
>> /db/_design/mydoc/_update/myupdatefunction` and others that accept documents 
>> via POST bodies will need to inspect the body for shard keys.
>> • Get a document: `GET /db/<shardkey>:<dockey>`.
>> • Special documents:
>>      • Design documents and local documents have no shardkey in their docID. 
>> This is enforced.
>> • Reject documents with IDs of the form `<shardkey>:_design/foo`.
>> • Query requests introduce a shardkey into the path, following an explicit 
>> `_shardkey` path part:
>>      • Mango: `POST /db/_shardkey/<shardkey>/_find`.
>>      • Views
>>              • `GET 
>> /db/_shardkey/<shardkey>/_design/mydoc/_view/myview?include_docs=true`
>>              • `POST 
>> /db/_shardkey/<shardkey>/_design/mydoc/_view/myview?include_docs=true`
>>      • Query results are restricted to documents with the shard key 
>> specified. Which makes things harder but leaves the door open for future 
>> things like shard-splitting without changing result sets. And it seems like 
>> what one would expect!
>> • The above is a white list of APIs, so a few notes on some cases I've left 
>> out with reasons:
>>      • _all_docs (can see uses, but would rather keep API small to start 
>> with). Note that Mango depends on the internal _all_docs API.
>>      • _changes (use-cases better served by explicit shard API).
>> 
>> ## What are we NOT addressing here?
>> 
>> This proposal only provides indirect shard addressing -- it's specifically 
>> not covering things discussed previously [1] per-shard changes feeds where 
>> one needs to directly address each shard. The shard key only controls the 
>> bucketing of documents into shards and doesn't say anything about, for 
>> example, whether two shard keys end up on the same shard -- that's an 
>> internal detail.
>> 
>> I'd really appreciate thoughts and comments before Cloudant get started on 
>> implementation of this.
>> 
>> Mike.
>> 
>> [1]: https://issues.apache.org/jira/browse/COUCHDB-2791
>> 
> 
> -- 
> Professional Support for Apache CouchDB:
> https://neighbourhood.ie/couchdb-support/ 
> <https://neighbourhood.ie/couchdb-support/>

Reply via email to