Hey Mike, I've been thinking more about your proposal above and when it is combined with the new access-per-db enhancement it should greatly reduce the need for db-per-user. One thing that I'm left wondering though is whether there is consideration for different shard keys per doc. From what I gather in your notes above, each doc would only have a single shard key and I think implementing this alone will take significant work. However, if there was a way to have multiple shard keys per doc then you could avoid having duplicated data.
For example, assume a database of student work: 1. Each doc has a `*username`* that corresponds with the owner of the doc 2. Each doc has a `*classId`* that corresponds with the class for which the assignment was submitted Ideally, you'd be able to issue a query with a shard key specific to the ` *username`* to get a student's work and yet another query with a shard key specific to the `*classId` *to get the work from a teacher's perspective. Would your proposal allow for something like this? If not, I think you'd have to do something like duplicate the data, e.g. add another doc that has the username of the teacher so that you could query from the teacher's perspective. This of course could get pretty messy when you consider more complicated scenarios as you could easily end up with a lot of duplicated data. Thanks! Geoff On Tue, Nov 28, 2017 at 5:35 AM Mike Rhodes <mrho...@linux.vnet.ibm.com> wrote: > > > On 25 Nov 2017, at 15:45, Adam Kocoloski <kocol...@apache.org> wrote: > > > > Yes indeed Jan :) Thanks Mike for writing this up! A couple of comments > on the proposal: > > > >> • For databases where this is enabled, every document needs a > shard key. > > > > What would happen if this constraint were relaxed, and documents without > a “:” in their ID simply used the full ID as the shard key as is done now? > > I think that practically it's not that awful. Documents without shard keys > end up spread reasonably, albeit uncontrollably, across shards. > > But I think from a usability perspective, forcing this to be all or > nothing for a database makes sense. It makes sure that every document in > the database behaves the same way rather than having a bunch of stuff that > behaves one way and a bunch of stuff that behaves a different way (i.e., > you can find some documents via shard local queries, whereas others are > only visible at a global level). > > I think that if people want documents to behave that differently, > enforcing different databases is helpful. It reinforces the point that > these databases work well for use-cases where partitioning data using the > shard key makes sense, which is a different method of data modelling than > having one huge undifferentiated pool. Perhaps there are heretofore > unthought of optimisations that only make sense if we can make this > assumption too :) > > > > >> • Query results are restricted to documents with the shard key > specified. Which makes things harder but leaves the door open for future > things like shard-splitting without changing result sets. And it seems like > what one would expect! > > > > I agree this is important. It took me a minute to remember the point > here, which is that a query specifying a shard key needs to filter out > results from different shard keys that happen to be colocated on the same > shard. > > > > Does the current query functionality still work as it did before in a > database without shard keys? That is, can I still issue a query without > specifying a shard key and have it collate a response from the full > dataset? I think this is worth addressing explicitly. My assumption is that > it does, although I’m worried that there may be a problematic interaction > if one tried to use the same physical index to satisfy both a “global” > query and a query specifying a shard key. > > I think this is an interesting question. > > To start with, I guess the basic thing is that to efficiently use an index > you'd imagine that you'd prefix the index's columns with the shard key -- > at least that's the thing I've been thinking, which likely means cleverer > options are available :) > > My first thought is that the naive approach to filtering documents not > matching a shard key is just that -- a node hosting a replica of a shard > does a query on an index as normal and then there's some extra code that > filters based on ID. Not actually super-awful -- we don't have to actually > read the document itself for example -- but for any use-case where there > are many shard keys associated with a given shard it feels like one can do > better. But as long as the node querying the index is doing it, it feels > pretty fast. > > I would wonder whether some more generally useful work on Mango could help > reduce the amount of special case code going on: > > - Push index selection down to each shard. > - Allow Mango to use multiple indexes to satisfy a query (even if this is > simply for AND relationships). > > Then for any database with the shard key bit set true, the shards also > create a JSON index based on the shard key, and we can append an `AND > shardkey=foo` to the users' Mango selector. As our shard keys are in the > doc ID, I don't think this is any faster at all. It would be if the shard > key was more complicated, say a field in the doc, so we didn't have it to > hand all the time. But it would certainly make the alteration for the shard > local path much more contained and have very wide utility beyond this case. > > For views, I'm less sure there's anything smart you can do that doesn't > add tonnes of overhead -- like making two indexes per view, one that's > prefixed with the shard key and one which is not. This approach has all > sorts of nasty interactions with things like reverse=true I imagine, > however. > > Mike. >