[
https://issues.apache.org/jira/browse/CALCITE-2141?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17030038#comment-17030038
]
Julian Hyde commented on CALCITE-2141:
--------------------------------------
Let’s frame it as a MV rewrite problem. Suppose you have a table T with a shard
column and you know that shard can be 0, 1 or 2. And you have an MV T0 = select
* from T where shard = 0; T1 and T2 similarly.
Can you answer queries such as select count(*) from T and select a from T where
shard < 2 in terms of just the MVs?
That maps into a fairly straightforward task in MV rewrite.
> Implement query rewrite based on sharding configuration
> -------------------------------------------------------
>
> Key: CALCITE-2141
> URL: https://issues.apache.org/jira/browse/CALCITE-2141
> Project: Calcite
> Issue Type: New Feature
> Reporter: Christian Beikov
> Priority: Minor
>
> Based on topology changes, it should be possible to dynamically update a
> sharding configuration for calcite. The effect of such a configuration is,
> that a query involving sharded tables is rewritten to a more optimal form
> possibly targetting mutliple different datasources.
> This is an interesting building block for distributed databases but also for
> applications since it enables the implementation of a static sharding scheme.
> Doing the shard rewriting on a client is also a lot better when using a
> distributed database as that eliminates the need for a coordinator node
> through which data is tunneled.
> Also see [https://github.com/shardingjdbc/sharding-jdbc] for an existing
> implementation.
> So imagine a topology with a master node and 2 worker nodes, one having
> shards with even numbers and the other one having shards with odd numbers.
> Table "A" is sharded by the column "tenant_id" into e.g. 30 shards. So the
> sharding configuration for table "A" would contain the information "worker 1
> has shards 1,3,5,..." and "worker 2 has shards 0,2,4,...". It also specifies
> that the sharding strategy should use a hash function {{shard = tenant_id %
> 30}}.
> When an application sends a query like e.g. {{select * from A where tenant_id
> = 1}}, normally the master/coordinator would do the shard rewriting, but
> doing this already at the client can eliminate the master as bottleneck for
> many scenarios. It is clear that the query can be completely fulfilled by
> worker 1 since it owns shard 1. The query rewriting therefore simply pushes
> the query to that worker. Note that there might be cases where a shard is
> replicated to other workers so it might be beneficial to make it configurable
> whether or when replicas should be used for querying.
> A query like \{{select * from A where tenant_id in(1,2)}} could be
> transformed to {{select * from worker1.A where tenant_id = 1 union all select
> * from worker2.A where tenant_id = 2}}. One optimization could be to target a
> single worker if it contains at least a replica of all required shards, but
> that would need to be configurable again since replicas might lag behind.
>
> DML statements obviously should be handled as well but at first, I would
> simply forbid the use of multiple workers within one transaction. Supporting
> multiple workers in a transaction will essentially require a 2PC and I'm not
> sure it's always a good idea to let an application be the coordinator for
> such a transaction. There should definitely be an option to let a
> master/coordinator node of a distributed database handle the details of the
> 2PC by configuring that DML statements should always be pushed to the
> master/coordinator node.
> The sharding-jdbc implementation only offers _BASE_ guarantees. I guess there
> are cases where this makes sense so making the transaction handling pluggable
> to allow other strategies would definitely be nice as well.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)