Thank you Ryan for proposing the DDL syntax! I think it's good to follow mainstream databases, and the proposed syntax looks very reasonable.
About Hive compatibility, I think it's not that important now, but it's still good if we keep it. Shall we support the Hive syntax as an alternative? It seems not very hard, just a few more ANTLR rules. It will be better if we can make ANTLR extensible and allow other data sources to define custom SQL syntax. Anyway I think they are orthogonal. We can go ahead with the proposed syntax here, and add Hive compatible syntax later. On Tue, Oct 2, 2018 at 11:50 PM Ryan Blue <rb...@netflix.com.invalid> wrote: > I'd say that it was important to be compatible with Hive in the past, but > that's becoming less important over time. Spark is well established with > Hadoop users and I think the focus moving forward should be to make Spark > more predictable as a SQL engine for people coming from more traditional > databases.. > > That said, I think there is no problem supporting the alter syntax for > both Hive/MySQL and the more standard versions. > > On Tue, Oct 2, 2018 at 8:35 AM Felix Cheung <felixcheun...@hotmail.com> > wrote: > >> I think it has been an important “selling point” that Spark is “mostly >> compatible“ with Hive DDL. >> >> I have see a lot of teams suffering from switching between Presto and >> Hive dialects. >> >> So one question I have is, we are at a point of switch from Hive >> compatible to ANSI SQL, say? >> >> Perhaps a more critical question, what does it take to get the platform >> to support both, by making the ANTLR extensible? >> >> >> >> ------------------------------ >> *From:* Alessandro Solimando <alessandro.solima...@gmail.com> >> *Sent:* Tuesday, October 2, 2018 12:35 AM >> *To:* rb...@netflix.com >> *Cc:* Xiao Li; dev >> *Subject:* Re: [DISCUSS] Syntax for table DDL >> >> I agree with Ryan, a "standard" and more widely adopted syntax is usually >> a good idea, with possibly some slight improvements like "bulk deletion" of >> columns (especially because both the syntax and the semantics are clear), >> rather than stay with Hive syntax at any cost. >> >> I am personally following this PR with a lot of interest, thanks for all >> the work along this direction. >> >> Best regards, >> Alessandro >> >> On Mon, 1 Oct 2018 at 20:21, Ryan Blue <rb...@netflix.com.invalid> wrote: >> >>> What do you mean by consistent with the syntax in SqlBase.g4? These >>> aren’t currently defined, so we need to decide what syntax to support. >>> There are more details below, but the syntax I’m proposing is more standard >>> across databases than Hive, which uses confusing and non-standard syntax. >>> >>> I doubt that we want to support Hive syntax for a few reasons. Hive uses >>> the same column CHANGE statement for multiple purposes, so it ends up >>> with strange patterns for simple tasks, like updating the column’s type: >>> >>> ALTER TABLE t CHANGE a1 a1 INT; >>> >>> The column name is doubled because old name, new name, and type are >>> always required. So you have to know the type of a column to change its >>> name and you have to double up the name to change its type. Hive also >>> allows a couple other oddities: >>> >>> - Column reordering with FIRST and AFTER keywords. Column reordering >>> is tricky to get right so I’m not sure we want to add it. >>> - RESTRICT and CASCADE to signal whether to change all partitions or >>> not. Spark doesn’t support partition-level schemas except through Hive, >>> and >>> even then I’m not sure how reliable it is. >>> >>> I know that we wouldn’t necessarily have to support these features from >>> Hive, but I’m pointing them out to ask the question: why copy Hive’s syntax >>> if it is unlikely that Spark will implement all of the “features”? I’d >>> rather go with SQL syntax from databases like PostgreSQL or others that are >>> more standard and common. >>> >>> The more “standard” versions of these statements are like what I’ve >>> proposed: >>> >>> - ALTER TABLE ident ALTER COLUMN qualifiedName TYPE dataType: ALTER >>> is used by SQL Server, Access, DB2, and PostgreSQL; MODIFY by MySQL >>> and Oracle. COLUMN is optional in Oracle and TYPE is omitted by >>> databases other than PosgreSQL. I think we could easily add MODIFY >>> as an alternative to the second ALTER (and maybe alternatives like >>> UPDATE and CHANGE) and make both TYPE and COLUMN optional. >>> - ALTER TABLE ident RENAME COLUMN qualifiedName TO qualifiedName: >>> This syntax is supported by PostgreSQL, Oracle, and DB2. MySQL uses the >>> same syntax as Hive and it appears that SQL server doesn’t have this >>> statement. This also match the table rename syntax, which uses TO. >>> - ALTER TABLE ident DROP (COLUMN | COLUMNS) qualifiedNameList: This >>> matches PostgreSQL, Oracle, DB2, and SQL server. MySQL makes COLUMN >>> optional. Most don’t allow deleting multiple columns, but it’s a >>> reasonable >>> extension. >>> >>> While we’re on the subject of ALTER TABLE DDL, I should note that all >>> of the databases use ADD COLUMN syntax that differs from Hive (and >>> currently, Spark): >>> >>> - ALTER TABLE ident ADD COLUMN qualifiedName dataType (',' >>> qualifiedName dataType)*: All other databases I looked at use ADD >>> COLUMN, but not all of them support adding multiple columns at the >>> same time. Hive requires ( and ) enclosing the columns and uses the >>> COLUMNS keyword instead of COLUMN. I think that Spark should be >>> updated to make the parens optional and to support both keywords, >>> COLUMN and COLUMNS. >>> >>> What does everyone think? Is it reasonable to use the more standard >>> syntax instead of using Hive as a base? >>> >>> rb >>> >>> On Fri, Sep 28, 2018 at 11:07 PM Xiao Li <gatorsm...@gmail.com> wrote: >>> >>>> Are they consistent with the current syntax defined in SqlBase.g4? I >>>> think we are following the Hive DDL syntax: >>>> https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-AlterTable/Partition/Column >>>> >>>> Ryan Blue <rb...@netflix.com.invalid> 于2018年9月28日周五 下午3:47写道: >>>> >>>>> Hi everyone, >>>>> >>>>> I’m currently working on new table DDL statements for v2 tables. For >>>>> context, the new logical plans for DataSourceV2 require a catalog >>>>> interface >>>>> so that Spark can create tables for operations like CTAS. The proposed >>>>> TableCatalog API also includes an API for altering those tables so we can >>>>> make ALTER TABLE statements work. I’m implementing those DDL statements, >>>>> which will make it into upstream Spark when the TableCatalog PR is merged. >>>>> >>>>> Since I’m adding new SQL statements that don’t yet exist in Spark, I >>>>> want to make sure that the syntax I’m using in our branch will match the >>>>> syntax we add to Spark later. I’m basing this proposed syntax on >>>>> PostgreSQL >>>>> <https://www.postgresql.org/docs/current/static/ddl-alter.html>. >>>>> >>>>> - *Update data type*: ALTER TABLE tableIdentifier ALTER COLUMN >>>>> qualifiedName TYPE dataType. >>>>> - *Rename column*: ALTER TABLE tableIdentifier RENAME COLUMN >>>>> qualifiedName TO qualifiedName >>>>> - *Drop column*: ALTER TABLE tableIdentifier DROP (COLUMN | >>>>> COLUMNS) qualifiedNameList >>>>> >>>>> A few notes: >>>>> >>>>> - Using qualifiedName in these rules allows updating nested types, >>>>> like point.x. >>>>> - Updates and renames can only alter one column, but drop can drop >>>>> a list. >>>>> - Rename can’t move types and will validate that if the TO name is >>>>> qualified, that the prefix matches the original field. >>>>> - I’m also changing ADD COLUMN to support adding fields to nested >>>>> columns by using qualifiedName instead of identifier. >>>>> >>>>> Please reply to this thread if you have suggestions based on a >>>>> different SQL engine or want this syntax to be different for another >>>>> reason. Thanks! >>>>> >>>>> rb >>>>> -- >>>>> Ryan Blue >>>>> Software Engineer >>>>> Netflix >>>>> >>>> >>> >>> -- >>> Ryan Blue >>> Software Engineer >>> Netflix >>> >> > > -- > Ryan Blue > Software Engineer > Netflix >