Re: [DISCUSS] FLIP-376: Add DISTRIBUTED BY clause

Timo Walther Mon, 30 Oct 2023 02:44:23 -0700

Hi Jark,

my intention was to avoid too complex syntax in the first version. Inthe past years, we could enable use cases also without this clause, sowe should be careful with overloading it with too functionality in thefirst version. We can still iterate on it later, the interfaces areflexible enough to support more in the future.

I agree that maybe an explicit HASH and RANGE doesn't harm. Also makingthe bucket number optional.

I updated the FLIP accordingly. Now the SupportsBucketing interfacedeclares more methods that help in validation and proving helpful errormessages to users.


Let me know what you think.

Regards,
Timo


On 27.10.23 10:20, Jark Wu wrote:

Hi Timo,

Thanks for starting this discussion. I really like it!
The FLIP is already in good shape, I only have some minor comments.

1. Could we also support HASH and RANGE distribution kind on the DDL
syntax?
I noticed that HASH and UNKNOWN are introduced in the Java API, but not in
the syntax.

2. Can we make "INTO n BUCKETS" optional in CREATE TABLE and ALTER TABLE?
Some storage engines support automatically determining the bucket number
based on
the cluster resources and data size of the table. For example, StarRocks[1]
and Paimon[2].

Best,
Jark

[1]:
https://docs.starrocks.io/en-us/latest/table_design/Data_distribution#determine-the-number-of-buckets
[2]:
https://paimon.apache.org/docs/0.5/concepts/primary-key-table/#dynamic-bucket

On Thu, 26 Oct 2023 at 18:26, Jingsong Li <jingsongl...@gmail.com> wrote:

Very thanks Timo for starting this discussion.

Big +1 for this.

The design looks good to me!

We can add some documentation for connector developers. For example:
for sink, If there needs some keyby, please finish the keyby by the
connector itself. SupportsBucketing is just a marker interface.

Best,
Jingsong

On Thu, Oct 26, 2023 at 5:00 PM Timo Walther <twal...@apache.org> wrote:


Hi everyone,

I would like to start a discussion on FLIP-376: Add DISTRIBUTED BY
clause [1].

Many SQL vendors expose the concepts of Partitioning, Bucketing, and
Clustering. This FLIP continues the work of previous FLIPs and would
like to introduce the concept of "Bucketing" to Flink.

This is a pure connector characteristic and helps both Apache Kafka and
Apache Paimon connectors in avoiding a complex WITH clause by providing
improved syntax.

Here is an example:

CREATE TABLE MyTable
    (
      uid BIGINT,
      name STRING
    )
    DISTRIBUTED BY (uid) INTO 6 BUCKETS
    WITH (
      'connector' = 'kafka'
    )

The full syntax specification can be found in the document. The clause
should be optional and fully backwards compatible.

Regards,
Timo

[1]

https://cwiki.apache.org/confluence/display/FLINK/FLIP-376%3A+Add+DISTRIBUTED+BY+clause

Re: [DISCUSS] FLIP-376: Add DISTRIBUTED BY clause

Reply via email to