Re: Impala Iceberg partition DDL syntax

Tim Armstrong Wed, 02 Jun 2021 21:45:44 -0700

FWIW it looks like Spark supports multiple syntaxes for Parquet at least -
both "using" and "stored as" -
https://spark.apache.org/docs/2.3.1/sql-programming-guide.html. I don't
know if that works the same for iceberg.


Agree that the Spark SQL syntax is reasonable and it probably makes Iceberg
adoption easier if it's aligned across different engines. I find the
function call syntax a little more intuitive too, as Aman said.

I guess a final option is to support both syntaxes, but I guess that's
probably going to be too confusing?

- Tim

On Wed, 2 Jun 2021 at 10:48, Aman Sinha <[email protected]> wrote:

> Thanks Zoltan for bringing this up.  The current Impala syntax for
> Iceberg:  'partition by spec ..' does seem a little odd although looking at
> the jira it seems it was inspired by Kudu's partition by syntax.
>
> I think it makes sense to align with the 'partitioned by ..' syntax that
> Spark is already supporting since these get baked into user workflows.
> Also, 'month(ts)' seems more intuitive than 'ts month' since the former
> indicates a function is being applied to the timestamp column.
>
> In terms of options, I would vote for #3 considering that 4.0 release has
> already been in the works and IMO this particular change is not really a
> blocker.  However, it is up to the release manager  to decide :)
>
> Aman
>
>
> On Wed, Jun 2, 2021 at 8:31 AM Zoltán Borók-Nagy <[email protected]>
> wrote:
>
> > Hi Team,
> >
> > We've been working on Iceberg support in Impala for quite some time.
> > The status is quite good, Impala master is able to read/write/alter
> Iceberg
> > tables (there's still some work to make it production-ready).
> >
> > The problem is that currently we have a DDL syntax for defining Iceberg
> > partitions that differs from SparkSQL:
> > https://iceberg.apache.org/spark-ddl/#partitioned-by
> >
> > E.g. Impala is using the following syntax:
> >
> > CREATE TABLE ice_t (i int, s string, ts timestamp, d date)
> >
> > *PARTITION BY SPEC (i BUCKET 5, ts MONTH, d YEAR)*
> >
> > STORED AS ICEBERG;
> >
> > The same in Spark is:
> >
> > CREATE TABLE ice_t (i int, s string, ts timestamp, d date)
> >
> > USING ICEBERG
> >
> > *PARTITIONED BY (bucket(5, i), months(ts), years(d))*
> >
> >
> > Impala's syntax is older but hasn't been released yet. Spark's syntax is
> > released so it cannot be changed.
> >
> > Hive is also working on DDL support for Iceberg partitions, and they are
> > favoring the released SparkSQL syntax.
> >
> > I think it would be favorable if Impala used the same syntax that the
> other
> > engines use.
> > The DDLs won't match exactly as Spark has USING while Impala has STORED
> AS,
> > and Hive will have STORED BY ICEBERG. But I think it would be nice if we
> > could converge as much as we can.
> >
> > Given that we want to have a 4.0 release soon I think we have the
> following
> > options:
> >
> >    1. Keep the current syntax, being inconsistent with other engines
> >    2. Hold on the 4.0 release and fix the syntax
> >    3. Marking Iceberg support as experimental and change syntax for 4.1
> >
> > What do you think? If you agree to change the syntax, I will volunteer to
> > implement it.
> >
> > Cheers,
> >     Zoltan
> >
>

Re: Impala Iceberg partition DDL syntax

Reply via email to