Re: Shard level querying in CouchDB Proposal

Adam Kocoloski Tue, 23 Jan 2018 12:28:09 -0800

The way I understand the proposal, you could satisfy at most one of those 
requests (probably the *username* one) with a local query. The other one would 
have to be a global query, but the proposal does allow for a mix of local and 
global queries against the same dataset.


Adam

> On Jan 22, 2018, at 9:27 AM, Geoffrey Cox <redge...@gmail.com> wrote:
> 
> Hey Mike,
> 
> I've been thinking more about your proposal above and when it is combined
> with the new access-per-db enhancement it should greatly reduce the need
> for db-per-user. One thing that I'm left wondering though is whether there
> is consideration for different shard keys per doc. From what I gather in
> your notes above, each doc would only have a single shard key and I think
> implementing this alone will take significant work. However, if there was a
> way to have multiple shard keys per doc then you could avoid having
> duplicated data.
> 
> For example, assume a database of student work:
> 
>   1. Each doc has a `*username`* that corresponds with the owner of the doc
>   2. Each doc has a `*classId`* that corresponds with the class for which
>   the assignment was submitted
> 
> Ideally, you'd be able to issue a query with a shard key specific to the `
> *username`* to get a student's work and yet another query with a shard key
> specific to the `*classId` *to get the work from a teacher's
> perspective. Would your proposal allow for something like this?
> 
> If not, I think you'd have to do something like duplicate the data, e.g.
> add another doc that has the username of the teacher so that you could
> query from the teacher's perspective. This of course could get pretty messy
> when you consider more complicated scenarios as you could easily end up
> with a lot of duplicated data.
> 
> Thanks!
> 
> Geoff
> 
> On Tue, Nov 28, 2017 at 5:35 AM Mike Rhodes <mrho...@linux.vnet.ibm.com>
> wrote:
> 
>> 
>>> On 25 Nov 2017, at 15:45, Adam Kocoloski <kocol...@apache.org> wrote:
>>> 
>>> Yes indeed Jan :) Thanks Mike for writing this up! A couple of comments
>> on the proposal:
>>> 
>>>>     • For databases where this is enabled, every document needs a
>> shard key.
>>> 
>>> What would happen if this constraint were relaxed, and documents without
>> a “:” in their ID simply used the full ID as the shard key as is done now?
>> 
>> I think that practically it's not that awful. Documents without shard keys
>> end up spread reasonably, albeit uncontrollably, across shards.
>> 
>> But I think from a usability perspective, forcing this to be all or
>> nothing for a database makes sense. It makes sure that every document in
>> the database behaves the same way rather than having a bunch of stuff that
>> behaves one way and a bunch of stuff that behaves a different way (i.e.,
>> you can find some documents via shard local queries, whereas others are
>> only visible at a global level).
>> 
>> I think that if people want documents to behave that differently,
>> enforcing different databases is helpful. It reinforces the point that
>> these databases work well for use-cases where partitioning data using the
>> shard key makes sense, which is a different method of data modelling than
>> having one huge undifferentiated pool. Perhaps there are heretofore
>> unthought of optimisations that only make sense if we can make this
>> assumption too :)
>> 
>>> 
>>>>     • Query results are restricted to documents with the shard key
>> specified. Which makes things harder but leaves the door open for future
>> things like shard-splitting without changing result sets. And it seems like
>> what one would expect!
>>> 
>>> I agree this is important. It took me a minute to remember the point
>> here, which is that a query specifying a shard key needs to filter out
>> results from different shard keys that happen to be colocated on the same
>> shard.
>>> 
>>> Does the current query functionality still work as it did before in a
>> database without shard keys? That is, can I still issue a query without
>> specifying a shard key and have it collate a response from the full
>> dataset? I think this is worth addressing explicitly. My assumption is that
>> it does, although I’m worried that there may be a problematic interaction
>> if one tried to use the same physical index to satisfy both a “global”
>> query and a query specifying a shard key.
>> 
>> I think this is an interesting question.
>> 
>> To start with, I guess the basic thing is that to efficiently use an index
>> you'd imagine that you'd prefix the index's columns with the shard key --
>> at least that's the thing I've been thinking, which likely means cleverer
>> options are available :)
>> 
>> My first thought is that the naive approach to filtering documents not
>> matching a shard key is just that -- a node hosting a replica of a shard
>> does a query on an index as normal and then there's some extra code that
>> filters based on ID. Not actually super-awful -- we don't have to actually
>> read the document itself for example -- but for any use-case where there
>> are many shard keys associated with a given shard it feels like one can do
>> better. But as long as the node querying the index is doing it, it feels
>> pretty fast.
>> 
>> I would wonder whether some more generally useful work on Mango could help
>> reduce the amount of special case code going on:
>> 
>> - Push index selection down to each shard.
>> - Allow Mango to use multiple indexes to satisfy a query (even if this is
>> simply for AND relationships).
>> 
>> Then for any database with the shard key bit set true, the shards also
>> create a JSON index based on the shard key, and we can append an `AND
>> shardkey=foo` to the users' Mango selector. As our shard keys are in the
>> doc ID, I don't think this is any faster at all. It would be if the shard
>> key was more complicated, say a field in the doc, so we didn't have it to
>> hand all the time. But it would certainly make the alteration for the shard
>> local path much more contained and have very wide utility beyond this case.
>> 
>> For views, I'm less sure there's anything smart you can do that doesn't
>> add tonnes of overhead -- like making two indexes per view, one that's
>> prefixed with the shard key and one which is not. This approach has all
>> sorts of nasty interactions with things like reverse=true I imagine,
>> however.
>> 
>> Mike.
>>

Re: Shard level querying in CouchDB Proposal

Reply via email to