Re: Shard level querying in CouchDB Proposal

Mike Rhodes Wed, 07 Feb 2018 02:43:18 -0800

Geoff,

Apologies for taking ages to reply.


You can only have a single shard key per document, because the shard key 
directly affects where the document ends up being written to disk and, modulo 
shard replicas, a document is only in one place within the primary data shards.

I think what you are thinking about is a major change to global view queries 
(i.e., what couchdb has now) which affects how indexes are sharded. Instead of 
an index shard being essentially the index for a given primary data shard, you 
shard the index based on the keys emitted in the view map function. This 
enables global view queries to be directed to single shard based on the keys 
involved.

This is a massive change, and probably affects a bunch of the assumptions built 
into the way queries are executed by the clustering layer as well as the actual 
mechanics of indexing primary data on one node but writing indexes data to 
another node (as the shard key for a document is no longer the same as a shard 
key for the indexed data, which is implicitly is in the current clustering 
model).

This proposal avoids this kettle of fish, but it is a really nice way to speed 
up queries in a simpler global model. I think it ends up being something that 
works well with my proposal but which is pretty orthogonal in terms of 
implementation.

Mike.

> On 23 Jan 2018, at 20:27, Adam Kocoloski <kocol...@apache.org> wrote:
> 
> The way I understand the proposal, you could satisfy at most one of those 
> requests (probably the *username* one) with a local query. The other one 
> would have to be a global query, but the proposal does allow for a mix of 
> local and global queries against the same dataset.
> 
> Adam
> 
>> On Jan 22, 2018, at 9:27 AM, Geoffrey Cox <redge...@gmail.com> wrote:
>> 
>> Hey Mike,
>> 
>> I've been thinking more about your proposal above and when it is combined
>> with the new access-per-db enhancement it should greatly reduce the need
>> for db-per-user. One thing that I'm left wondering though is whether there
>> is consideration for different shard keys per doc. From what I gather in
>> your notes above, each doc would only have a single shard key and I think
>> implementing this alone will take significant work. However, if there was a
>> way to have multiple shard keys per doc then you could avoid having
>> duplicated data.
>> 
>> For example, assume a database of student work:
>> 
>>  1. Each doc has a `*username`* that corresponds with the owner of the doc
>>  2. Each doc has a `*classId`* that corresponds with the class for which
>>  the assignment was submitted
>> 
>> Ideally, you'd be able to issue a query with a shard key specific to the `
>> *username`* to get a student's work and yet another query with a shard key
>> specific to the `*classId` *to get the work from a teacher's
>> perspective. Would your proposal allow for something like this?
>> 
>> If not, I think you'd have to do something like duplicate the data, e.g.
>> add another doc that has the username of the teacher so that you could
>> query from the teacher's perspective. This of course could get pretty messy
>> when you consider more complicated scenarios as you could easily end up
>> with a lot of duplicated data.
>> 
>> Thanks!
>> 
>> Geoff
>> 
>> On Tue, Nov 28, 2017 at 5:35 AM Mike Rhodes <mrho...@linux.vnet.ibm.com>
>> wrote:
>> 
>>> 
>>>> On 25 Nov 2017, at 15:45, Adam Kocoloski <kocol...@apache.org> wrote:
>>>> 
>>>> Yes indeed Jan :) Thanks Mike for writing this up! A couple of comments
>>> on the proposal:
>>>> 
>>>>>    • For databases where this is enabled, every document needs a
>>> shard key.
>>>> 
>>>> What would happen if this constraint were relaxed, and documents without
>>> a “:” in their ID simply used the full ID as the shard key as is done now?
>>> 
>>> I think that practically it's not that awful. Documents without shard keys
>>> end up spread reasonably, albeit uncontrollably, across shards.
>>> 
>>> But I think from a usability perspective, forcing this to be all or
>>> nothing for a database makes sense. It makes sure that every document in
>>> the database behaves the same way rather than having a bunch of stuff that
>>> behaves one way and a bunch of stuff that behaves a different way (i.e.,
>>> you can find some documents via shard local queries, whereas others are
>>> only visible at a global level).
>>> 
>>> I think that if people want documents to behave that differently,
>>> enforcing different databases is helpful. It reinforces the point that
>>> these databases work well for use-cases where partitioning data using the
>>> shard key makes sense, which is a different method of data modelling than
>>> having one huge undifferentiated pool. Perhaps there are heretofore
>>> unthought of optimisations that only make sense if we can make this
>>> assumption too :)
>>> 
>>>> 
>>>>>    • Query results are restricted to documents with the shard key
>>> specified. Which makes things harder but leaves the door open for future
>>> things like shard-splitting without changing result sets. And it seems like
>>> what one would expect!
>>>> 
>>>> I agree this is important. It took me a minute to remember the point
>>> here, which is that a query specifying a shard key needs to filter out
>>> results from different shard keys that happen to be colocated on the same
>>> shard.
>>>> 
>>>> Does the current query functionality still work as it did before in a
>>> database without shard keys? That is, can I still issue a query without
>>> specifying a shard key and have it collate a response from the full
>>> dataset? I think this is worth addressing explicitly. My assumption is that
>>> it does, although I’m worried that there may be a problematic interaction
>>> if one tried to use the same physical index to satisfy both a “global”
>>> query and a query specifying a shard key.
>>> 
>>> I think this is an interesting question.
>>> 
>>> To start with, I guess the basic thing is that to efficiently use an index
>>> you'd imagine that you'd prefix the index's columns with the shard key --
>>> at least that's the thing I've been thinking, which likely means cleverer
>>> options are available :)
>>> 
>>> My first thought is that the naive approach to filtering documents not
>>> matching a shard key is just that -- a node hosting a replica of a shard
>>> does a query on an index as normal and then there's some extra code that
>>> filters based on ID. Not actually super-awful -- we don't have to actually
>>> read the document itself for example -- but for any use-case where there
>>> are many shard keys associated with a given shard it feels like one can do
>>> better. But as long as the node querying the index is doing it, it feels
>>> pretty fast.
>>> 
>>> I would wonder whether some more generally useful work on Mango could help
>>> reduce the amount of special case code going on:
>>> 
>>> - Push index selection down to each shard.
>>> - Allow Mango to use multiple indexes to satisfy a query (even if this is
>>> simply for AND relationships).
>>> 
>>> Then for any database with the shard key bit set true, the shards also
>>> create a JSON index based on the shard key, and we can append an `AND
>>> shardkey=foo` to the users' Mango selector. As our shard keys are in the
>>> doc ID, I don't think this is any faster at all. It would be if the shard
>>> key was more complicated, say a field in the doc, so we didn't have it to
>>> hand all the time. But it would certainly make the alteration for the shard
>>> local path much more contained and have very wide utility beyond this case.
>>> 
>>> For views, I'm less sure there's anything smart you can do that doesn't
>>> add tonnes of overhead -- like making two indexes per view, one that's
>>> prefixed with the shard key and one which is not. This approach has all
>>> sorts of nasty interactions with things like reverse=true I imagine,
>>> however.
>>> 
>>> Mike.
>>> 
>

Re: Shard level querying in CouchDB Proposal

Reply via email to