Thanks for the clarification, Mike. I figured that having multiple shard keys per doc would be a big change, but I was hoping I was wrong ;). I still think your proposed solution will add a lot of value to CouchDB. Unfortunately, it isn't going to be part of the silver bullet that makes it feasible to step away from db-per-user in many cases as in the end you need shards/partitions that correspond with your user's queries or else it will require visiting each shard when a query is issued, something that wouldn't be very scalable with a very large database with lots of nodes.
In my mind, either the database has to "replicate" the data to the right shards/partitions or you have to do this manually. On Wed, Feb 7, 2018 at 2:43 AM Mike Rhodes <mrho...@linux.vnet.ibm.com> wrote: > Geoff, > > Apologies for taking ages to reply. > > You can only have a single shard key per document, because the shard key > directly affects where the document ends up being written to disk and, > modulo shard replicas, a document is only in one place within the primary > data shards. > > I think what you are thinking about is a major change to global view > queries (i.e., what couchdb has now) which affects how indexes are sharded. > Instead of an index shard being essentially the index for a given primary > data shard, you shard the index based on the keys emitted in the view map > function. This enables global view queries to be directed to single shard > based on the keys involved. > > This is a massive change, and probably affects a bunch of the assumptions > built into the way queries are executed by the clustering layer as well as > the actual mechanics of indexing primary data on one node but writing > indexes data to another node (as the shard key for a document is no longer > the same as a shard key for the indexed data, which is implicitly is in the > current clustering model). > > This proposal avoids this kettle of fish, but it is a really nice way to > speed up queries in a simpler global model. I think it ends up being > something that works well with my proposal but which is pretty orthogonal > in terms of implementation. > > Mike. > > > On 23 Jan 2018, at 20:27, Adam Kocoloski <kocol...@apache.org> wrote: > > > > The way I understand the proposal, you could satisfy at most one of > those requests (probably the *username* one) with a local query. The other > one would have to be a global query, but the proposal does allow for a mix > of local and global queries against the same dataset. > > > > Adam > > > >> On Jan 22, 2018, at 9:27 AM, Geoffrey Cox <redge...@gmail.com> wrote: > >> > >> Hey Mike, > >> > >> I've been thinking more about your proposal above and when it is > combined > >> with the new access-per-db enhancement it should greatly reduce the need > >> for db-per-user. One thing that I'm left wondering though is whether > there > >> is consideration for different shard keys per doc. From what I gather in > >> your notes above, each doc would only have a single shard key and I > think > >> implementing this alone will take significant work. However, if there > was a > >> way to have multiple shard keys per doc then you could avoid having > >> duplicated data. > >> > >> For example, assume a database of student work: > >> > >> 1. Each doc has a `*username`* that corresponds with the owner of the > doc > >> 2. Each doc has a `*classId`* that corresponds with the class for which > >> the assignment was submitted > >> > >> Ideally, you'd be able to issue a query with a shard key specific to > the ` > >> *username`* to get a student's work and yet another query with a shard > key > >> specific to the `*classId` *to get the work from a teacher's > >> perspective. Would your proposal allow for something like this? > >> > >> If not, I think you'd have to do something like duplicate the data, e.g. > >> add another doc that has the username of the teacher so that you could > >> query from the teacher's perspective. This of course could get pretty > messy > >> when you consider more complicated scenarios as you could easily end up > >> with a lot of duplicated data. > >> > >> Thanks! > >> > >> Geoff > >> > >> On Tue, Nov 28, 2017 at 5:35 AM Mike Rhodes <mrho...@linux.vnet.ibm.com > > > >> wrote: > >> > >>> > >>>> On 25 Nov 2017, at 15:45, Adam Kocoloski <kocol...@apache.org> wrote: > >>>> > >>>> Yes indeed Jan :) Thanks Mike for writing this up! A couple of > comments > >>> on the proposal: > >>>> > >>>>> • For databases where this is enabled, every document needs a > >>> shard key. > >>>> > >>>> What would happen if this constraint were relaxed, and documents > without > >>> a “:” in their ID simply used the full ID as the shard key as is done > now? > >>> > >>> I think that practically it's not that awful. Documents without shard > keys > >>> end up spread reasonably, albeit uncontrollably, across shards. > >>> > >>> But I think from a usability perspective, forcing this to be all or > >>> nothing for a database makes sense. It makes sure that every document > in > >>> the database behaves the same way rather than having a bunch of stuff > that > >>> behaves one way and a bunch of stuff that behaves a different way > (i.e., > >>> you can find some documents via shard local queries, whereas others are > >>> only visible at a global level). > >>> > >>> I think that if people want documents to behave that differently, > >>> enforcing different databases is helpful. It reinforces the point that > >>> these databases work well for use-cases where partitioning data using > the > >>> shard key makes sense, which is a different method of data modelling > than > >>> having one huge undifferentiated pool. Perhaps there are heretofore > >>> unthought of optimisations that only make sense if we can make this > >>> assumption too :) > >>> > >>>> > >>>>> • Query results are restricted to documents with the shard key > >>> specified. Which makes things harder but leaves the door open for > future > >>> things like shard-splitting without changing result sets. And it seems > like > >>> what one would expect! > >>>> > >>>> I agree this is important. It took me a minute to remember the point > >>> here, which is that a query specifying a shard key needs to filter out > >>> results from different shard keys that happen to be colocated on the > same > >>> shard. > >>>> > >>>> Does the current query functionality still work as it did before in a > >>> database without shard keys? That is, can I still issue a query without > >>> specifying a shard key and have it collate a response from the full > >>> dataset? I think this is worth addressing explicitly. My assumption is > that > >>> it does, although I’m worried that there may be a problematic > interaction > >>> if one tried to use the same physical index to satisfy both a “global” > >>> query and a query specifying a shard key. > >>> > >>> I think this is an interesting question. > >>> > >>> To start with, I guess the basic thing is that to efficiently use an > index > >>> you'd imagine that you'd prefix the index's columns with the shard key > -- > >>> at least that's the thing I've been thinking, which likely means > cleverer > >>> options are available :) > >>> > >>> My first thought is that the naive approach to filtering documents not > >>> matching a shard key is just that -- a node hosting a replica of a > shard > >>> does a query on an index as normal and then there's some extra code > that > >>> filters based on ID. Not actually super-awful -- we don't have to > actually > >>> read the document itself for example -- but for any use-case where > there > >>> are many shard keys associated with a given shard it feels like one > can do > >>> better. But as long as the node querying the index is doing it, it > feels > >>> pretty fast. > >>> > >>> I would wonder whether some more generally useful work on Mango could > help > >>> reduce the amount of special case code going on: > >>> > >>> - Push index selection down to each shard. > >>> - Allow Mango to use multiple indexes to satisfy a query (even if this > is > >>> simply for AND relationships). > >>> > >>> Then for any database with the shard key bit set true, the shards also > >>> create a JSON index based on the shard key, and we can append an `AND > >>> shardkey=foo` to the users' Mango selector. As our shard keys are in > the > >>> doc ID, I don't think this is any faster at all. It would be if the > shard > >>> key was more complicated, say a field in the doc, so we didn't have it > to > >>> hand all the time. But it would certainly make the alteration for the > shard > >>> local path much more contained and have very wide utility beyond this > case. > >>> > >>> For views, I'm less sure there's anything smart you can do that doesn't > >>> add tonnes of overhead -- like making two indexes per view, one that's > >>> prefixed with the shard key and one which is not. This approach has all > >>> sorts of nasty interactions with things like reverse=true I imagine, > >>> however. > >>> > >>> Mike. > >>> > > > >