[
https://issues.apache.org/jira/browse/CALCITE-2141?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17030142#comment-17030142
]
Julian Hyde edited comment on CALCITE-2141 at 2/4/20 11:20 PM:
---------------------------------------------------------------
Yes, what [~christian.beikov] suggests is in line with my plan.
I think you should stick to the simple case, at first, where each table has a
shard column. We can generalize later to situations where tables are sharded
based on ranges, modulos etc.
Calcite has a neat feature called stored generated columns (see CALCITE-1991).
If we want to handle complex sharding conditions, we could still introduce
{{shard}} as a stored generated column. The predicates on that column will make
optimization possible.
was (Author: julianhyde):
Yes, what [~christian.beikov] is in line with my plan.
I think you should stick to the simple case, at first, where each table has a
shard column. We can generalize later to situations where tables are sharded
based on ranges, modulos etc.
Calcite has a neat feature called stored generated columns (see CALCITE-1991).
If we want to handle complex sharding conditions, we could still introduce
{{shard}} as a stored generated column. The predicates on that column will make
optimization possible.
> Implement query rewrite based on sharding configuration
> -------------------------------------------------------
>
> Key: CALCITE-2141
> URL: https://issues.apache.org/jira/browse/CALCITE-2141
> Project: Calcite
> Issue Type: New Feature
> Reporter: Christian Beikov
> Priority: Minor
>
> Based on topology changes, it should be possible to dynamically update a
> sharding configuration for calcite. The effect of such a configuration is,
> that a query involving sharded tables is rewritten to a more optimal form
> possibly targetting mutliple different datasources.
> This is an interesting building block for distributed databases but also for
> applications since it enables the implementation of a static sharding scheme.
> Doing the shard rewriting on a client is also a lot better when using a
> distributed database as that eliminates the need for a coordinator node
> through which data is tunneled.
> Also see [https://github.com/shardingjdbc/sharding-jdbc] for an existing
> implementation.
> So imagine a topology with a master node and 2 worker nodes, one having
> shards with even numbers and the other one having shards with odd numbers.
> Table "A" is sharded by the column "tenant_id" into e.g. 30 shards. So the
> sharding configuration for table "A" would contain the information "worker 1
> has shards 1,3,5,..." and "worker 2 has shards 0,2,4,...". It also specifies
> that the sharding strategy should use a hash function {{shard = tenant_id %
> 30}}.
> When an application sends a query like e.g. {{select * from A where tenant_id
> = 1}}, normally the master/coordinator would do the shard rewriting, but
> doing this already at the client can eliminate the master as bottleneck for
> many scenarios. It is clear that the query can be completely fulfilled by
> worker 1 since it owns shard 1. The query rewriting therefore simply pushes
> the query to that worker. Note that there might be cases where a shard is
> replicated to other workers so it might be beneficial to make it configurable
> whether or when replicas should be used for querying.
> A query like \{{select * from A where tenant_id in(1,2)}} could be
> transformed to {{select * from worker1.A where tenant_id = 1 union all select
> * from worker2.A where tenant_id = 2}}. One optimization could be to target a
> single worker if it contains at least a replica of all required shards, but
> that would need to be configurable again since replicas might lag behind.
>
> DML statements obviously should be handled as well but at first, I would
> simply forbid the use of multiple workers within one transaction. Supporting
> multiple workers in a transaction will essentially require a 2PC and I'm not
> sure it's always a good idea to let an application be the coordinator for
> such a transaction. There should definitely be an option to let a
> master/coordinator node of a distributed database handle the details of the
> 2PC by configuring that DML statements should always be pushed to the
> master/coordinator node.
> The sharding-jdbc implementation only offers _BASE_ guarantees. I guess there
> are cases where this makes sense so making the transaction handling pluggable
> to allow other strategies would definitely be nice as well.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)