Johs,

Good question. I think really it comes down to allowing the developer more 
options in how they access data, and how the database can help this be more 
efficient.

I don't think it reduces the usefulness of clustering. We still have three 
replicas of each shard, which allows different nodes in the cluster to respond 
to queries in the case of partitions and failures, which is the main benefit of 
clustering to my mind, while allowing developers to avoid some of the downsides 
if their data allows for it.

If a developer has a use case where data naturally cleaves into shards and 
queries are directed within the partitions created by that cleaving, it's 
wasteful to force them into the overhead of scatter-gather. This is about 
letting a developer help the database work faster for them, and also allow a 
greater throughput for a given cluster size if their data allows them to do 
this.

Clustered queries are currently expensive because:

1. All shards of a database are queried, which can be needless as some shards 
may contain no results.
2. Querying scales with shard count, which, while linear, isn't great for 
response times in larger databases.
3. Consolidating the result set before sending to the client can be a slow 
process, particularly if one node in the cluster is slow because for any sorted 
result set you have to wait for the next result from all shards before knowing 
which to send out (otherwise you can't guarantee correct ordering).

This proposal addresses both by allowing the customer to group data in a way 
which makes sense for (most) of their queries.

Now, the key issue is that a customer is more likely to cause themselves issues 
by not using a shard key with a good cardinality level, or one in which, like 
the IoT-by-date example, causes unbalanced writes. However, the benefits 
accrued from allowing single shard queries when a user has appropriately chosen 
shard keys are large.

Mike.

> On 24 Nov 2017, at 05:50, Johs Ensby <j...@b2w.com> wrote:
> 
> Hi Mike and Geoff,
> forgive me if I am asking a really stupid question, but
> wouldn't restricting certain data to specific shards defy the very concept 
> and core benefits of a clustered database?
> br
> Johs
> 
>> On 23 Nov 2017, at 22:49, Geoffrey Cox <redge...@gmail.com> wrote:
>> 
>> Ah, yeah, this makes sense to me. I think this has great potential!
>> 
>> On Thu, Nov 23, 2017 at 4:56 AM Mike Rhodes <mrho...@linux.vnet.ibm.com>
>> wrote:
>> 
>>> 
>>> 
>>>> On 22 Nov 2017, at 18:39, Geoffrey Cox <redge...@gmail.com> wrote:
>>>> 
>>>> Hi Mike, this sounds like a pretty cool enhancement. Just to clarify,
>>>> you're also proposing modifying the PUT/POST doc, etc... so that you can
>>>> specify a shard key per doc so that the doc can be stored on a specific
>>>> shard?
>>> 
>>> Yes, sort of. A document create request specifies a shard key as part of
>>> the document ID. The guarantee with respect to document placement then is:
>>> 
>>> "All documents with the same shard key are stored in the same shard".
>>> 
>>> By means of contrast, this *isn't* a way of saying "Put document on
>>> specific shard X". I don't find that ability very compelling for a user
>>> (why would they care that their doc was in range 000000000-abababab or
>>> whatever?), but introducing this grouping mechanism as a higher level
>>> abstraction on things meaningful within a data model I think does offer
>>> substantial benefit.
>>> 
>>> To elaborate on why this is useful a couple use-cases might help.
>>> 
>>> The first example is along the lines of using a user ID as a shard key.
>>> All documents for that user then end up on the same shard. A query can then
>>> be scoped by user ID (as its the shard key), which means that queries for a
>>> single user's data can be efficiently served from a single shard rather
>>> than asking all shards. This would significantly improve performance of an
>>> application from the point of view of that user.
>>> 
>>> Or, in an IoT use case, you might use the device ID as the shard key
>>> enabling fast retrieval of measurements from a single device.
>>> 
>>> It's important to note too that a shard may store documents from many
>>> different shard keys, so long as the above guarantee holds. In addition,
>>> the shard key needs to have high cardinality and to effectively spread
>>> requests over the shards.
>>> 
>>> An example that doesn't work is using the date as the shard key for the
>>> IoT case: while this has a high cardinality, at any given time, only a
>>> single shard will be in the write path.
>>> 
>>> Mike.
>>> 
>>> 
>>> 
> 

Reply via email to