FWIW it looks like Spark supports multiple syntaxes for Parquet at least - both "using" and "stored as" - https://spark.apache.org/docs/2.3.1/sql-programming-guide.html. I don't know if that works the same for iceberg.
Agree that the Spark SQL syntax is reasonable and it probably makes Iceberg adoption easier if it's aligned across different engines. I find the function call syntax a little more intuitive too, as Aman said. I guess a final option is to support both syntaxes, but I guess that's probably going to be too confusing? - Tim On Wed, 2 Jun 2021 at 10:48, Aman Sinha <amansi...@apache.org> wrote: > Thanks Zoltan for bringing this up. The current Impala syntax for > Iceberg: 'partition by spec ..' does seem a little odd although looking at > the jira it seems it was inspired by Kudu's partition by syntax. > > I think it makes sense to align with the 'partitioned by ..' syntax that > Spark is already supporting since these get baked into user workflows. > Also, 'month(ts)' seems more intuitive than 'ts month' since the former > indicates a function is being applied to the timestamp column. > > In terms of options, I would vote for #3 considering that 4.0 release has > already been in the works and IMO this particular change is not really a > blocker. However, it is up to the release manager to decide :) > > Aman > > > On Wed, Jun 2, 2021 at 8:31 AM Zoltán Borók-Nagy <borokna...@apache.org> > wrote: > > > Hi Team, > > > > We've been working on Iceberg support in Impala for quite some time. > > The status is quite good, Impala master is able to read/write/alter > Iceberg > > tables (there's still some work to make it production-ready). > > > > The problem is that currently we have a DDL syntax for defining Iceberg > > partitions that differs from SparkSQL: > > https://iceberg.apache.org/spark-ddl/#partitioned-by > > > > E.g. Impala is using the following syntax: > > > > CREATE TABLE ice_t (i int, s string, ts timestamp, d date) > > > > *PARTITION BY SPEC (i BUCKET 5, ts MONTH, d YEAR)* > > > > STORED AS ICEBERG; > > > > The same in Spark is: > > > > CREATE TABLE ice_t (i int, s string, ts timestamp, d date) > > > > USING ICEBERG > > > > *PARTITIONED BY (bucket(5, i), months(ts), years(d))* > > > > > > Impala's syntax is older but hasn't been released yet. Spark's syntax is > > released so it cannot be changed. > > > > Hive is also working on DDL support for Iceberg partitions, and they are > > favoring the released SparkSQL syntax. > > > > I think it would be favorable if Impala used the same syntax that the > other > > engines use. > > The DDLs won't match exactly as Spark has USING while Impala has STORED > AS, > > and Hive will have STORED BY ICEBERG. But I think it would be nice if we > > could converge as much as we can. > > > > Given that we want to have a 4.0 release soon I think we have the > following > > options: > > > > 1. Keep the current syntax, being inconsistent with other engines > > 2. Hold on the 4.0 release and fix the syntax > > 3. Marking Iceberg support as experimental and change syntax for 4.1 > > > > What do you think? If you agree to change the syntax, I will volunteer to > > implement it. > > > > Cheers, > > Zoltan > > >