> On 25 Nov 2017, at 15:45, Adam Kocoloski <kocol...@apache.org> wrote:
> 
> Yes indeed Jan :) Thanks Mike for writing this up! A couple of comments on 
> the proposal:
> 
>>      • For databases where this is enabled, every document needs a shard key.
> 
> What would happen if this constraint were relaxed, and documents without a 
> “:” in their ID simply used the full ID as the shard key as is done now?

I think that practically it's not that awful. Documents without shard keys end 
up spread reasonably, albeit uncontrollably, across shards.

But I think from a usability perspective, forcing this to be all or nothing for 
a database makes sense. It makes sure that every document in the database 
behaves the same way rather than having a bunch of stuff that behaves one way 
and a bunch of stuff that behaves a different way (i.e., you can find some 
documents via shard local queries, whereas others are only visible at a global 
level).

I think that if people want documents to behave that differently, enforcing 
different databases is helpful. It reinforces the point that these databases 
work well for use-cases where partitioning data using the shard key makes 
sense, which is a different method of data modelling than having one huge 
undifferentiated pool. Perhaps there are heretofore unthought of optimisations 
that only make sense if we can make this assumption too :)

> 
>>      • Query results are restricted to documents with the shard key 
>> specified. Which makes things harder but leaves the door open for future 
>> things like shard-splitting without changing result sets. And it seems like 
>> what one would expect!
> 
> I agree this is important. It took me a minute to remember the point here, 
> which is that a query specifying a shard key needs to filter out results from 
> different shard keys that happen to be colocated on the same shard.
> 
> Does the current query functionality still work as it did before in a 
> database without shard keys? That is, can I still issue a query without 
> specifying a shard key and have it collate a response from the full dataset? 
> I think this is worth addressing explicitly. My assumption is that it does, 
> although I’m worried that there may be a problematic interaction if one tried 
> to use the same physical index to satisfy both a “global” query and a query 
> specifying a shard key.

I think this is an interesting question.

To start with, I guess the basic thing is that to efficiently use an index 
you'd imagine that you'd prefix the index's columns with the shard key -- at 
least that's the thing I've been thinking, which likely means cleverer options 
are available :)

My first thought is that the naive approach to filtering documents not matching 
a shard key is just that -- a node hosting a replica of a shard does a query on 
an index as normal and then there's some extra code that filters based on ID. 
Not actually super-awful -- we don't have to actually read the document itself 
for example -- but for any use-case where there are many shard keys associated 
with a given shard it feels like one can do better. But as long as the node 
querying the index is doing it, it feels pretty fast.

I would wonder whether some more generally useful work on Mango could help 
reduce the amount of special case code going on:

- Push index selection down to each shard.
- Allow Mango to use multiple indexes to satisfy a query (even if this is 
simply for AND relationships).

Then for any database with the shard key bit set true, the shards also create a 
JSON index based on the shard key, and we can append an `AND shardkey=foo` to 
the users' Mango selector. As our shard keys are in the doc ID, I don't think 
this is any faster at all. It would be if the shard key was more complicated, 
say a field in the doc, so we didn't have it to hand all the time. But it would 
certainly make the alteration for the shard local path much more contained and 
have very wide utility beyond this case.

For views, I'm less sure there's anything smart you can do that doesn't add 
tonnes of overhead -- like making two indexes per view, one that's prefixed 
with the shard key and one which is not. This approach has all sorts of nasty 
interactions with things like reverse=true I imagine, however.

Mike.

Reply via email to