Re: Impala Iceberg partition DDL syntax

Zoltán Borók-Nagy Fri, 04 Jun 2021 02:23:20 -0700

Thank you all for your response.

Yeah, I didn't think about supporting both syntaxes.
But since the feature is still in an early, unreleased stage, I think it's
probably better to stick to one syntax. It's less confusing, no need to
maintain two code bases for the same thing, etc.


So I think the conclusion here is option 3: "Marking Iceberg support as
experimental and change syntax for 4.1".
Unless someone has strong objections about it.

If not, I'll start working on the required code changes next week.

Cheers,
    Zoltan


On Thu, Jun 3, 2021 at 8:59 AM Amogh Margoor <amarg...@cloudera.com> wrote:

> I agree looking at the existing support for Iceberg (Spark SQL, Trino and
> Flink) and also what Impala supports for non-iceberg tables it makes more
> sense to go with:
>
> *PARTITIONED BY (bucket(5, i), months(ts), years(d))*
>
> - PARTITIONED BY is supported by Spark SQL, Flink and for HDFS tables in
> Impala.
> - Function syntax for Partition transform for Iceberg's hidden partition is
> already supported by Spark SQL and Trino (Flink doesn't support hidden
> partitions yet).
>
> IMO, it's better to support just one syntax instead of both. For instance,
> if in future more complex Partition transforms are introduced which take
> more than 2 arguments ('f(x,y,z)') then supporting it in PARTITION BY SPEC
> syntax can start getting confusing.
>
> Regards,
> Amogh
>
>
>
>
>
> On Thu, Jun 3, 2021 at 5:45 AM Tim Armstrong <tim.g.armstr...@gmail.com>
> wrote:
>
> > FWIW it looks like Spark supports multiple syntaxes for Parquet at least
> -
> > both "using" and "stored as" -
> > https://spark.apache.org/docs/2.3.1/sql-programming-guide.html. I don't
> > know if that works the same for iceberg.
> >
> > Agree that the Spark SQL syntax is reasonable and it probably makes
> Iceberg
> > adoption easier if it's aligned across different engines. I find the
> > function call syntax a little more intuitive too, as Aman said.
> >
> > I guess a final option is to support both syntaxes, but I guess that's
> > probably going to be too confusing?
> >
> > - Tim
> >
> > On Wed, 2 Jun 2021 at 10:48, Aman Sinha <amansi...@apache.org> wrote:
> >
> > > Thanks Zoltan for bringing this up.  The current Impala syntax for
> > > Iceberg:  'partition by spec ..' does seem a little odd although
> looking
> > at
> > > the jira it seems it was inspired by Kudu's partition by syntax.
> > >
> > > I think it makes sense to align with the 'partitioned by ..' syntax
> that
> > > Spark is already supporting since these get baked into user workflows.
> > > Also, 'month(ts)' seems more intuitive than 'ts month' since the former
> > > indicates a function is being applied to the timestamp column.
> > >
> > > In terms of options, I would vote for #3 considering that 4.0 release
> has
> > > already been in the works and IMO this particular change is not really
> a
> > > blocker.  However, it is up to the release manager  to decide :)
> > >
> > > Aman
> > >
> > >
> > > On Wed, Jun 2, 2021 at 8:31 AM Zoltán Borók-Nagy <
> borokna...@apache.org>
> > > wrote:
> > >
> > > > Hi Team,
> > > >
> > > > We've been working on Iceberg support in Impala for quite some time.
> > > > The status is quite good, Impala master is able to read/write/alter
> > > Iceberg
> > > > tables (there's still some work to make it production-ready).
> > > >
> > > > The problem is that currently we have a DDL syntax for defining
> Iceberg
> > > > partitions that differs from SparkSQL:
> > > > https://iceberg.apache.org/spark-ddl/#partitioned-by
> > > >
> > > > E.g. Impala is using the following syntax:
> > > >
> > > > CREATE TABLE ice_t (i int, s string, ts timestamp, d date)
> > > >
> > > > *PARTITION BY SPEC (i BUCKET 5, ts MONTH, d YEAR)*
> > > >
> > > > STORED AS ICEBERG;
> > > >
> > > > The same in Spark is:
> > > >
> > > > CREATE TABLE ice_t (i int, s string, ts timestamp, d date)
> > > >
> > > > USING ICEBERG
> > > >
> > > > *PARTITIONED BY (bucket(5, i), months(ts), years(d))*
> > > >
> > > >
> > > > Impala's syntax is older but hasn't been released yet. Spark's syntax
> > is
> > > > released so it cannot be changed.
> > > >
> > > > Hive is also working on DDL support for Iceberg partitions, and they
> > are
> > > > favoring the released SparkSQL syntax.
> > > >
> > > > I think it would be favorable if Impala used the same syntax that the
> > > other
> > > > engines use.
> > > > The DDLs won't match exactly as Spark has USING while Impala has
> STORED
> > > AS,
> > > > and Hive will have STORED BY ICEBERG. But I think it would be nice if
> > we
> > > > could converge as much as we can.
> > > >
> > > > Given that we want to have a 4.0 release soon I think we have the
> > > following
> > > > options:
> > > >
> > > >    1. Keep the current syntax, being inconsistent with other engines
> > > >    2. Hold on the 4.0 release and fix the syntax
> > > >    3. Marking Iceberg support as experimental and change syntax for
> 4.1
> > > >
> > > > What do you think? If you agree to change the syntax, I will
> volunteer
> > to
> > > > implement it.
> > > >
> > > > Cheers,
> > > >     Zoltan
> > > >
> > >
> >
>

Re: Impala Iceberg partition DDL syntax

Reply via email to