Re: Impala Iceberg partition DDL syntax

Amogh Margoor Wed, 02 Jun 2021 23:59:48 -0700

I agree looking at the existing support for Iceberg (Spark SQL, Trino and
Flink) and also what Impala supports for non-iceberg tables it makes more
sense to go with:


*PARTITIONED BY (bucket(5, i), months(ts), years(d))*

- PARTITIONED BY is supported by Spark SQL, Flink and for HDFS tables in
Impala.
- Function syntax for Partition transform for Iceberg's hidden partition is
already supported by Spark SQL and Trino (Flink doesn't support hidden
partitions yet).

IMO, it's better to support just one syntax instead of both. For instance,
if in future more complex Partition transforms are introduced which take
more than 2 arguments ('f(x,y,z)') then supporting it in PARTITION BY SPEC
syntax can start getting confusing.

Regards,
Amogh





On Thu, Jun 3, 2021 at 5:45 AM Tim Armstrong <[email protected]>
wrote:

> FWIW it looks like Spark supports multiple syntaxes for Parquet at least -
> both "using" and "stored as" -
> https://spark.apache.org/docs/2.3.1/sql-programming-guide.html. I don't
> know if that works the same for iceberg.
>
> Agree that the Spark SQL syntax is reasonable and it probably makes Iceberg
> adoption easier if it's aligned across different engines. I find the
> function call syntax a little more intuitive too, as Aman said.
>
> I guess a final option is to support both syntaxes, but I guess that's
> probably going to be too confusing?
>
> - Tim
>
> On Wed, 2 Jun 2021 at 10:48, Aman Sinha <[email protected]> wrote:
>
> > Thanks Zoltan for bringing this up.  The current Impala syntax for
> > Iceberg:  'partition by spec ..' does seem a little odd although looking
> at
> > the jira it seems it was inspired by Kudu's partition by syntax.
> >
> > I think it makes sense to align with the 'partitioned by ..' syntax that
> > Spark is already supporting since these get baked into user workflows.
> > Also, 'month(ts)' seems more intuitive than 'ts month' since the former
> > indicates a function is being applied to the timestamp column.
> >
> > In terms of options, I would vote for #3 considering that 4.0 release has
> > already been in the works and IMO this particular change is not really a
> > blocker.  However, it is up to the release manager  to decide :)
> >
> > Aman
> >
> >
> > On Wed, Jun 2, 2021 at 8:31 AM Zoltán Borók-Nagy <[email protected]>
> > wrote:
> >
> > > Hi Team,
> > >
> > > We've been working on Iceberg support in Impala for quite some time.
> > > The status is quite good, Impala master is able to read/write/alter
> > Iceberg
> > > tables (there's still some work to make it production-ready).
> > >
> > > The problem is that currently we have a DDL syntax for defining Iceberg
> > > partitions that differs from SparkSQL:
> > > https://iceberg.apache.org/spark-ddl/#partitioned-by
> > >
> > > E.g. Impala is using the following syntax:
> > >
> > > CREATE TABLE ice_t (i int, s string, ts timestamp, d date)
> > >
> > > *PARTITION BY SPEC (i BUCKET 5, ts MONTH, d YEAR)*
> > >
> > > STORED AS ICEBERG;
> > >
> > > The same in Spark is:
> > >
> > > CREATE TABLE ice_t (i int, s string, ts timestamp, d date)
> > >
> > > USING ICEBERG
> > >
> > > *PARTITIONED BY (bucket(5, i), months(ts), years(d))*
> > >
> > >
> > > Impala's syntax is older but hasn't been released yet. Spark's syntax
> is
> > > released so it cannot be changed.
> > >
> > > Hive is also working on DDL support for Iceberg partitions, and they
> are
> > > favoring the released SparkSQL syntax.
> > >
> > > I think it would be favorable if Impala used the same syntax that the
> > other
> > > engines use.
> > > The DDLs won't match exactly as Spark has USING while Impala has STORED
> > AS,
> > > and Hive will have STORED BY ICEBERG. But I think it would be nice if
> we
> > > could converge as much as we can.
> > >
> > > Given that we want to have a 4.0 release soon I think we have the
> > following
> > > options:
> > >
> > >    1. Keep the current syntax, being inconsistent with other engines
> > >    2. Hold on the 4.0 release and fix the syntax
> > >    3. Marking Iceberg support as experimental and change syntax for 4.1
> > >
> > > What do you think? If you agree to change the syntax, I will volunteer
> to
> > > implement it.
> > >
> > > Cheers,
> > >     Zoltan
> > >
> >
>

Re: Impala Iceberg partition DDL syntax

Reply via email to