I'd say that it was important to be compatible with Hive in the past, but that's becoming less important over time. Spark is well established with Hadoop users and I think the focus moving forward should be to make Spark more predictable as a SQL engine for people coming from more traditional databases..
That said, I think there is no problem supporting the alter syntax for both Hive/MySQL and the more standard versions. On Tue, Oct 2, 2018 at 8:35 AM Felix Cheung <felixcheun...@hotmail.com> wrote: > I think it has been an important “selling point” that Spark is “mostly > compatible“ with Hive DDL. > > I have see a lot of teams suffering from switching between Presto and Hive > dialects. > > So one question I have is, we are at a point of switch from Hive > compatible to ANSI SQL, say? > > Perhaps a more critical question, what does it take to get the platform to > support both, by making the ANTLR extensible? > > > > ------------------------------ > *From:* Alessandro Solimando <alessandro.solima...@gmail.com> > *Sent:* Tuesday, October 2, 2018 12:35 AM > *To:* rb...@netflix.com > *Cc:* Xiao Li; dev > *Subject:* Re: [DISCUSS] Syntax for table DDL > > I agree with Ryan, a "standard" and more widely adopted syntax is usually > a good idea, with possibly some slight improvements like "bulk deletion" of > columns (especially because both the syntax and the semantics are clear), > rather than stay with Hive syntax at any cost. > > I am personally following this PR with a lot of interest, thanks for all > the work along this direction. > > Best regards, > Alessandro > > On Mon, 1 Oct 2018 at 20:21, Ryan Blue <rb...@netflix.com.invalid> wrote: > >> What do you mean by consistent with the syntax in SqlBase.g4? These >> aren’t currently defined, so we need to decide what syntax to support. >> There are more details below, but the syntax I’m proposing is more standard >> across databases than Hive, which uses confusing and non-standard syntax. >> >> I doubt that we want to support Hive syntax for a few reasons. Hive uses >> the same column CHANGE statement for multiple purposes, so it ends up >> with strange patterns for simple tasks, like updating the column’s type: >> >> ALTER TABLE t CHANGE a1 a1 INT; >> >> The column name is doubled because old name, new name, and type are >> always required. So you have to know the type of a column to change its >> name and you have to double up the name to change its type. Hive also >> allows a couple other oddities: >> >> - Column reordering with FIRST and AFTER keywords. Column reordering >> is tricky to get right so I’m not sure we want to add it. >> - RESTRICT and CASCADE to signal whether to change all partitions or >> not. Spark doesn’t support partition-level schemas except through Hive, >> and >> even then I’m not sure how reliable it is. >> >> I know that we wouldn’t necessarily have to support these features from >> Hive, but I’m pointing them out to ask the question: why copy Hive’s syntax >> if it is unlikely that Spark will implement all of the “features”? I’d >> rather go with SQL syntax from databases like PostgreSQL or others that are >> more standard and common. >> >> The more “standard” versions of these statements are like what I’ve >> proposed: >> >> - ALTER TABLE ident ALTER COLUMN qualifiedName TYPE dataType: ALTER >> is used by SQL Server, Access, DB2, and PostgreSQL; MODIFY by MySQL >> and Oracle. COLUMN is optional in Oracle and TYPE is omitted by >> databases other than PosgreSQL. I think we could easily add MODIFY as >> an alternative to the second ALTER (and maybe alternatives like UPDATE >> and CHANGE) and make both TYPE and COLUMN optional. >> - ALTER TABLE ident RENAME COLUMN qualifiedName TO qualifiedName: >> This syntax is supported by PostgreSQL, Oracle, and DB2. MySQL uses the >> same syntax as Hive and it appears that SQL server doesn’t have this >> statement. This also match the table rename syntax, which uses TO. >> - ALTER TABLE ident DROP (COLUMN | COLUMNS) qualifiedNameList: This >> matches PostgreSQL, Oracle, DB2, and SQL server. MySQL makes COLUMN >> optional. Most don’t allow deleting multiple columns, but it’s a >> reasonable >> extension. >> >> While we’re on the subject of ALTER TABLE DDL, I should note that all of >> the databases use ADD COLUMN syntax that differs from Hive (and >> currently, Spark): >> >> - ALTER TABLE ident ADD COLUMN qualifiedName dataType (',' >> qualifiedName dataType)*: All other databases I looked at use ADD >> COLUMN, but not all of them support adding multiple columns at the >> same time. Hive requires ( and ) enclosing the columns and uses the >> COLUMNS keyword instead of COLUMN. I think that Spark should be >> updated to make the parens optional and to support both keywords, >> COLUMN and COLUMNS. >> >> What does everyone think? Is it reasonable to use the more standard >> syntax instead of using Hive as a base? >> >> rb >> >> On Fri, Sep 28, 2018 at 11:07 PM Xiao Li <gatorsm...@gmail.com> wrote: >> >>> Are they consistent with the current syntax defined in SqlBase.g4? I >>> think we are following the Hive DDL syntax: >>> https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-AlterTable/Partition/Column >>> >>> Ryan Blue <rb...@netflix.com.invalid> 于2018年9月28日周五 下午3:47写道: >>> >>>> Hi everyone, >>>> >>>> I’m currently working on new table DDL statements for v2 tables. For >>>> context, the new logical plans for DataSourceV2 require a catalog interface >>>> so that Spark can create tables for operations like CTAS. The proposed >>>> TableCatalog API also includes an API for altering those tables so we can >>>> make ALTER TABLE statements work. I’m implementing those DDL statements, >>>> which will make it into upstream Spark when the TableCatalog PR is merged. >>>> >>>> Since I’m adding new SQL statements that don’t yet exist in Spark, I >>>> want to make sure that the syntax I’m using in our branch will match the >>>> syntax we add to Spark later. I’m basing this proposed syntax on >>>> PostgreSQL >>>> <https://www.postgresql.org/docs/current/static/ddl-alter.html>. >>>> >>>> - *Update data type*: ALTER TABLE tableIdentifier ALTER COLUMN >>>> qualifiedName TYPE dataType. >>>> - *Rename column*: ALTER TABLE tableIdentifier RENAME COLUMN >>>> qualifiedName TO qualifiedName >>>> - *Drop column*: ALTER TABLE tableIdentifier DROP (COLUMN | >>>> COLUMNS) qualifiedNameList >>>> >>>> A few notes: >>>> >>>> - Using qualifiedName in these rules allows updating nested types, >>>> like point.x. >>>> - Updates and renames can only alter one column, but drop can drop >>>> a list. >>>> - Rename can’t move types and will validate that if the TO name is >>>> qualified, that the prefix matches the original field. >>>> - I’m also changing ADD COLUMN to support adding fields to nested >>>> columns by using qualifiedName instead of identifier. >>>> >>>> Please reply to this thread if you have suggestions based on a >>>> different SQL engine or want this syntax to be different for another >>>> reason. Thanks! >>>> >>>> rb >>>> -- >>>> Ryan Blue >>>> Software Engineer >>>> Netflix >>>> >>> >> >> -- >> Ryan Blue >> Software Engineer >> Netflix >> > -- Ryan Blue Software Engineer Netflix