Johs, Good question. I think really it comes down to allowing the developer more options in how they access data, and how the database can help this be more efficient.
I don't think it reduces the usefulness of clustering. We still have three replicas of each shard, which allows different nodes in the cluster to respond to queries in the case of partitions and failures, which is the main benefit of clustering to my mind, while allowing developers to avoid some of the downsides if their data allows for it. If a developer has a use case where data naturally cleaves into shards and queries are directed within the partitions created by that cleaving, it's wasteful to force them into the overhead of scatter-gather. This is about letting a developer help the database work faster for them, and also allow a greater throughput for a given cluster size if their data allows them to do this. Clustered queries are currently expensive because: 1. All shards of a database are queried, which can be needless as some shards may contain no results. 2. Querying scales with shard count, which, while linear, isn't great for response times in larger databases. 3. Consolidating the result set before sending to the client can be a slow process, particularly if one node in the cluster is slow because for any sorted result set you have to wait for the next result from all shards before knowing which to send out (otherwise you can't guarantee correct ordering). This proposal addresses both by allowing the customer to group data in a way which makes sense for (most) of their queries. Now, the key issue is that a customer is more likely to cause themselves issues by not using a shard key with a good cardinality level, or one in which, like the IoT-by-date example, causes unbalanced writes. However, the benefits accrued from allowing single shard queries when a user has appropriately chosen shard keys are large. Mike. > On 24 Nov 2017, at 05:50, Johs Ensby <j...@b2w.com> wrote: > > Hi Mike and Geoff, > forgive me if I am asking a really stupid question, but > wouldn't restricting certain data to specific shards defy the very concept > and core benefits of a clustered database? > br > Johs > >> On 23 Nov 2017, at 22:49, Geoffrey Cox <redge...@gmail.com> wrote: >> >> Ah, yeah, this makes sense to me. I think this has great potential! >> >> On Thu, Nov 23, 2017 at 4:56 AM Mike Rhodes <mrho...@linux.vnet.ibm.com> >> wrote: >> >>> >>> >>>> On 22 Nov 2017, at 18:39, Geoffrey Cox <redge...@gmail.com> wrote: >>>> >>>> Hi Mike, this sounds like a pretty cool enhancement. Just to clarify, >>>> you're also proposing modifying the PUT/POST doc, etc... so that you can >>>> specify a shard key per doc so that the doc can be stored on a specific >>>> shard? >>> >>> Yes, sort of. A document create request specifies a shard key as part of >>> the document ID. The guarantee with respect to document placement then is: >>> >>> "All documents with the same shard key are stored in the same shard". >>> >>> By means of contrast, this *isn't* a way of saying "Put document on >>> specific shard X". I don't find that ability very compelling for a user >>> (why would they care that their doc was in range 000000000-abababab or >>> whatever?), but introducing this grouping mechanism as a higher level >>> abstraction on things meaningful within a data model I think does offer >>> substantial benefit. >>> >>> To elaborate on why this is useful a couple use-cases might help. >>> >>> The first example is along the lines of using a user ID as a shard key. >>> All documents for that user then end up on the same shard. A query can then >>> be scoped by user ID (as its the shard key), which means that queries for a >>> single user's data can be efficiently served from a single shard rather >>> than asking all shards. This would significantly improve performance of an >>> application from the point of view of that user. >>> >>> Or, in an IoT use case, you might use the device ID as the shard key >>> enabling fast retrieval of measurements from a single device. >>> >>> It's important to note too that a shard may store documents from many >>> different shard keys, so long as the above guarantee holds. In addition, >>> the shard key needs to have high cardinality and to effectively spread >>> requests over the shards. >>> >>> An example that doesn't work is using the date as the shard key for the >>> IoT case: while this has a high cardinality, at any given time, only a >>> single shard will be in the write path. >>> >>> Mike. >>> >>> >>> >