> On 25 Nov 2017, at 15:45, Adam Kocoloski <kocol...@apache.org> wrote: > > Yes indeed Jan :) Thanks Mike for writing this up! A couple of comments on > the proposal: > >> • For databases where this is enabled, every document needs a shard key. > > What would happen if this constraint were relaxed, and documents without a > “:” in their ID simply used the full ID as the shard key as is done now?
I think that practically it's not that awful. Documents without shard keys end up spread reasonably, albeit uncontrollably, across shards. But I think from a usability perspective, forcing this to be all or nothing for a database makes sense. It makes sure that every document in the database behaves the same way rather than having a bunch of stuff that behaves one way and a bunch of stuff that behaves a different way (i.e., you can find some documents via shard local queries, whereas others are only visible at a global level). I think that if people want documents to behave that differently, enforcing different databases is helpful. It reinforces the point that these databases work well for use-cases where partitioning data using the shard key makes sense, which is a different method of data modelling than having one huge undifferentiated pool. Perhaps there are heretofore unthought of optimisations that only make sense if we can make this assumption too :) > >> • Query results are restricted to documents with the shard key >> specified. Which makes things harder but leaves the door open for future >> things like shard-splitting without changing result sets. And it seems like >> what one would expect! > > I agree this is important. It took me a minute to remember the point here, > which is that a query specifying a shard key needs to filter out results from > different shard keys that happen to be colocated on the same shard. > > Does the current query functionality still work as it did before in a > database without shard keys? That is, can I still issue a query without > specifying a shard key and have it collate a response from the full dataset? > I think this is worth addressing explicitly. My assumption is that it does, > although I’m worried that there may be a problematic interaction if one tried > to use the same physical index to satisfy both a “global” query and a query > specifying a shard key. I think this is an interesting question. To start with, I guess the basic thing is that to efficiently use an index you'd imagine that you'd prefix the index's columns with the shard key -- at least that's the thing I've been thinking, which likely means cleverer options are available :) My first thought is that the naive approach to filtering documents not matching a shard key is just that -- a node hosting a replica of a shard does a query on an index as normal and then there's some extra code that filters based on ID. Not actually super-awful -- we don't have to actually read the document itself for example -- but for any use-case where there are many shard keys associated with a given shard it feels like one can do better. But as long as the node querying the index is doing it, it feels pretty fast. I would wonder whether some more generally useful work on Mango could help reduce the amount of special case code going on: - Push index selection down to each shard. - Allow Mango to use multiple indexes to satisfy a query (even if this is simply for AND relationships). Then for any database with the shard key bit set true, the shards also create a JSON index based on the shard key, and we can append an `AND shardkey=foo` to the users' Mango selector. As our shard keys are in the doc ID, I don't think this is any faster at all. It would be if the shard key was more complicated, say a field in the doc, so we didn't have it to hand all the time. But it would certainly make the alteration for the shard local path much more contained and have very wide utility beyond this case. For views, I'm less sure there's anything smart you can do that doesn't add tonnes of overhead -- like making two indexes per view, one that's prefixed with the shard key and one which is not. This approach has all sorts of nasty interactions with things like reverse=true I imagine, however. Mike.