Hi Team,

We've been working on Iceberg support in Impala for quite some time.
The status is quite good, Impala master is able to read/write/alter Iceberg
tables (there's still some work to make it production-ready).

The problem is that currently we have a DDL syntax for defining Iceberg
partitions that differs from SparkSQL:
https://iceberg.apache.org/spark-ddl/#partitioned-by

E.g. Impala is using the following syntax:

CREATE TABLE ice_t (i int, s string, ts timestamp, d date)

*PARTITION BY SPEC (i BUCKET 5, ts MONTH, d YEAR)*

STORED AS ICEBERG;

The same in Spark is:

CREATE TABLE ice_t (i int, s string, ts timestamp, d date)

USING ICEBERG

*PARTITIONED BY (bucket(5, i), months(ts), years(d))*


Impala's syntax is older but hasn't been released yet. Spark's syntax is
released so it cannot be changed.

Hive is also working on DDL support for Iceberg partitions, and they are
favoring the released SparkSQL syntax.

I think it would be favorable if Impala used the same syntax that the other
engines use.
The DDLs won't match exactly as Spark has USING while Impala has STORED AS,
and Hive will have STORED BY ICEBERG. But I think it would be nice if we
could converge as much as we can.

Given that we want to have a 4.0 release soon I think we have the following
options:

   1. Keep the current syntax, being inconsistent with other engines
   2. Hold on the 4.0 release and fix the syntax
   3. Marking Iceberg support as experimental and change syntax for 4.1

What do you think? If you agree to change the syntax, I will volunteer to
implement it.

Cheers,
    Zoltan

Reply via email to