Side comment: The current docs for CREATE TABLE <https://github.com/apache/spark/blob/4237251861c79f3176de7cf5232f0388ec5d946e/docs/sql-ref-syntax-ddl-create-table.md#description> add to the confusion by describing the Hive-compatible command as "CREATE TABLE USING HIVE FORMAT", but neither "USING" nor "HIVE FORMAT" are actually part of the syntax <https://github.com/apache/spark/blob/4237251861c79f3176de7cf5232f0388ec5d946e/docs/sql-ref-syntax-ddl-create-table-hiveformat.md> .
On Wed, Mar 18, 2020 at 8:31 PM Ryan Blue <rb...@netflix.com.invalid> wrote: > Jungtaek, it sounds like you consider the two rules to be separate > syntaxes with their own consistency rules. For example, if I am using the > Hive syntax rule, then the PARTITIONED BY clause adds new (partition) > columns and requires types for those columns; if I’m using the Spark syntax > rule with USING then PARTITIONED BY must reference existing columns and > cannot include types. > > I agree that this is confusing to users! We should fix it, but I don’t > think the right solution is to continue to have two rules with divergent > syntax. > > This is confusing to users because they don’t know anything about separate > parser rules. All the user sees is that sometimes PARTITION BY requires > types and sometimes it doesn’t. Yes, we could add a keyword, HIVE, to > signal that the syntax is borrowed from Hive for that case, but that > actually breaks queries that run in Hive. > > I think the right solution is to unify the two syntaxes. I don’t think > they are so different that it isn’t possible. Here are the differences I > see: > > - Only in Hive: > - EXTERNAL > - skewSpec: SKEWED BY ... > - rowFormat: ROW FORMAT DELIMITED ..., ROW FORMAT SERDE ... > - createFileFormat: STORED AS ... > - Only in Spark: > - OPTIONS property list > - Different syntax/interpretation: > - PARTITIONED BY transformList / PARTITIONED BY colTypeList > > For the clauses that are supported in one but not the other, we can add > them to a unified rule as optional clauses. The AST builder would then > validate what makes sense or not (e.g., stored as with using or row format > delimited) and finally pass the remaining data on using the > CreateTableStatement. That statement would be handled like we do for the > Spark rule today, but with extra metadata to pass along. This is also a > step toward being able to put Hive behind the DSv2 API because we’d be able > to pass all of the Hive metadata clauses to the v2 catalog. > > The only difficult part is handling PARTITIONED BY. But in that case, we > can use two different syntaxes from the same CREATE TABLE rule. If types > are included, we use the Hive PARTITIONED BY syntax and convert in the > AST builder to normalize to a single representation. > > What do you both think? This would make the behavior more clear and take a > step toward getting rid of Hive-specific code. > > On Wed, Mar 18, 2020 at 4:45 PM Jungtaek Lim <kabhwan.opensou...@gmail.com> > wrote: > >> I'm trying to understand the reason you have been suggesting to keep the >> real thing unchanged but change doc instead. Could you please elaborate >> why? End users would blame us when they hit the case their query doesn't >> work as intended (1) and found the fact it's undocumented (2) and hard to >> understand even from the Spark codebase (3). >> >> For me, addressing the root issue adopting your suggestion would be >> "dropping the rule 2" and only supporting it with legacy config on. We >> would say to end users, you need to enable the legacy config to leverage >> Hive create table syntax, or just use beeline with Hive connected. >> >> But since we are even thinking about native syntax as a first class and >> dropping Hive one implicitly (hide in doc) or explicitly, does it really >> matter we require a marker (like "HIVE") in rule 2 and isolate it? It would >> have even less confusion than Spark 2.x, since we will require end users to >> fill the marker regarding Hive when creating Hive table, easier to classify >> than "USING provider". >> >> If we think native syntax would cover many cases end users have been >> creating Hive table in Spark (say, USING hive would simply work for them), >> I'm OK to drop the rule 2 and lead end users to enable the legacy config if >> really needed. If not, let's continue "fixing" the issue. >> >> (Another valid approach would be consolidating two rules into one, and >> defining support of parameters per provider, e.g. EXTERNAL, STORED AS, ROW >> FORMAT, etc. are only supported in Hive provider.) >> >> >> On Wed, Mar 18, 2020 at 8:47 PM Wenchen Fan <cloud0...@gmail.com> wrote: >> >>> The fact that we have 2 CREATE TABLE syntax is already confusing many >>> users. Shall we only document the native syntax? Then users don't need to >>> worry about which rule their query fits and they don't need to spend a lot >>> of time understanding the subtle difference between these 2 syntaxes. >>> >>> On Wed, Mar 18, 2020 at 7:01 PM Jungtaek Lim < >>> kabhwan.opensou...@gmail.com> wrote: >>> >>>> A bit correction: the example I provided for vice versa is not really a >>>> correct case for vice versa. It's actually same case (intended to use rule >>>> 2 which is not default) but different result. >>>> >>>> On Wed, Mar 18, 2020 at 7:22 PM Jungtaek Lim < >>>> kabhwan.opensou...@gmail.com> wrote: >>>> >>>>> My concern is that although we simply think about the change to >>>>> mark "USING provider" to be optional in rule 1, but in reality the change >>>>> is most likely swapping the default rule for CREATE TABLE, which was "rule >>>>> 2", and now "rule 1" (it would be the happy case of migration doc if the >>>>> swap happens as intended), and there're still couple of things which make >>>>> the query still fall into rule 2 which is non-trivial to reason about and >>>>> also not easy to explain. >>>>> >>>>> I only mentioned ROW FORMAT and STORED AS as the case to fall into >>>>> rule 2 to simplify the problem statement, but they're not only the case >>>>> - using col_name1:col_type1 would make the query fall into rule 2 >>>>> regardless of any other properties. >>>>> >>>>> What's the matter? In Spark 2.x, if end users want to use rule 1 >>>>> (which is not default) and specify the parameters which are only available >>>>> in rule 1, it clearly requires "USING provider" - parser will throw error >>>>> if there're any mistakes on specifying parameters. End users could set >>>>> intention clearly which rule the query should be bound. If the query fails >>>>> to bind the rule as intended, it's simply denied. >>>>> >>>>> In Spark 3.x, parser may not help figuring out the case where end >>>>> users intend to use rule 2 (which is not default) but some mistake on >>>>> specifying parameters - it could just "silently" bound into rule 1 and it >>>>> may be even executed without any error. Vice versa happens, but in odd way >>>>> - e.g. CREATE EXTERNAL TABLE ... LOCATION fails with weird message CREATE >>>>> EXTERNAL TABLE is not supported. >>>>> >>>>> It's deterministic for end users only if they're fully understand the >>>>> difference between the twos and also understand how Spark would apply the >>>>> rules to make the query fall into one of the rule. I'm not sure how we can >>>>> only improve documentation to make things be clear, but if the approach >>>>> would be explaining the difference of rules and guide the tip to make the >>>>> query be bound to the specific rule, the same could be applied to parser >>>>> rule to address the root cause. >>>>> >>>>> >>>>> On Wed, Mar 18, 2020 at 6:24 PM Wenchen Fan <cloud0...@gmail.com> >>>>> wrote: >>>>> >>>>>> Document-wise, yes, it's confusing as a simple CREATE TABLE fits both >>>>>> native and Hive syntax. I'm fine with some changes to make it less >>>>>> confusing, as long as the user-facing behavior doesn't change. For >>>>>> example, >>>>>> define "ROW FORMAT" or "STORED AS" as mandatory only if the legacy config >>>>>> is false. >>>>>> >>>>>> I still don't get your point about what's the real problem to end >>>>>> users. There is no ambiguity as the behavior is deterministic, although >>>>>> we >>>>>> rely on optional fields and rule order which is bad. It's hard to >>>>>> document, >>>>>> but I don't think that's a big problem to end users. >>>>>> >>>>>> For the legacy config, it does make the implementation more >>>>>> complicated, but it's invisible to most end users (we don't document it) >>>>>> and can be super useful to some users who want the queries to keep >>>>>> working >>>>>> in 3.0 without rewriting. >>>>>> >>>>>> If your only concern is documentation, I totally agree that we should >>>>>> improve it. >>>>>> >>>>>> On Wed, Mar 18, 2020 at 4:36 PM Jungtaek Lim < >>>>>> kabhwan.opensou...@gmail.com> wrote: >>>>>> >>>>>>> Thanks for sharing your view. >>>>>>> >>>>>>> I agree with you it's good for Spark to promote Spark's own CREATE >>>>>>> TABLE syntax. The thing is, we still leave Hive CREATE TABLE syntax >>>>>>> unchanged - it's being said as "convenience" but I'm not sure I can >>>>>>> agree >>>>>>> with. >>>>>>> >>>>>>> I'll quote my comments in SPARK-31136 here again to make the problem >>>>>>> statement be clearer: >>>>>>> >>>>>>> I think the parser implementation around CREATE TABLE brings >>>>>>> ambiguity which is not documented anywhere. It wasn't ambiguous because >>>>>>> we >>>>>>> forced to specify USE provider if it's not a Hive table. Now it's either >>>>>>> default provider or Hive according to which options are provided, which >>>>>>> seems to be non-trivial to reason about. (End users would never know, as >>>>>>> it's completely from parser rule.) >>>>>>> >>>>>>> I feel this as the issue of "not breaking old behavior". The parser >>>>>>> rule gets pretty much complicated due to support legacy config. Not >>>>>>> breaking anything would make us be stuck eventually. >>>>>>> >>>>>>> >>>>>>> https://github.com/apache/spark/blob/master/docs/sql-migration-guide.md >>>>>>> >>>>>>> Since Spark 3.0, CREATE TABLE without a specific provider will use >>>>>>> the value of spark.sql.sources.default as its provider. In Spark version >>>>>>> 2.4 and earlier, it was hive. To restore the behavior before Spark 3.0, >>>>>>> you >>>>>>> can set spark.sql.legacy.createHiveTableByDefault.enabled to true. >>>>>>> >>>>>>> >>>>>>> It's not true if "ROW FORMAT" / "STORED AS" are provided, and we >>>>>>> didn't describe anything for this. >>>>>>> >>>>>>> >>>>>>> https://github.com/apache/spark/blob/master/docs/sql-ref-syntax-ddl-create-table-datasource.md >>>>>>> >>>>>>> CREATE TABLE [ IF NOT EXISTS ] table_identifier [ ( col_name1 >>>>>>> col_type1 [ COMMENT col_comment1 ], ... ) ] [USING data_source] [ >>>>>>> OPTIONS ( >>>>>>> key1=val1, key2=val2, ... ) ] [ PARTITIONED BY ( col_name1, col_name2, >>>>>>> ... >>>>>>> ) ] [ CLUSTERED BY ( col_name3, col_name4, ... ) [ SORTED BY ( col_name >>>>>>> [ >>>>>>> ASC | DESC ], ... ) ] INTO num_buckets BUCKETS ] [ LOCATION path ] [ >>>>>>> COMMENT table_comment ] [ TBLPROPERTIES ( key1=val1, key2=val2, ... ) ] >>>>>>> [ >>>>>>> AS select_statement ] >>>>>>> >>>>>>> >>>>>>> >>>>>>> https://github.com/apache/spark/blob/master/docs/sql-ref-syntax-ddl-create-table-hiveformat.md >>>>>>> >>>>>>> CREATE [ EXTERNAL ] TABLE [ IF NOT EXISTS ] table_identifier [ ( >>>>>>> col_name1[:] col_type1 [ COMMENT col_comment1 ], ... ) ] [ COMMENT >>>>>>> table_comment ] [ PARTITIONED BY ( col_name2[:] col_type2 [ COMMENT >>>>>>> col_comment2 ], ... ) | ( col_name1, col_name2, ... ) ] [ ROW FORMAT >>>>>>> row_format ] [ STORED AS file_format ] [ LOCATION path ] [ >>>>>>> TBLPROPERTIES ( >>>>>>> key1=val1, key2=val2, ... ) ] [ AS select_statement ] >>>>>>> >>>>>>> >>>>>>> At least we should describe that parser will try to match the first >>>>>>> case (create table ~ using data source), and fail back to second case; >>>>>>> even >>>>>>> though we describe this it's not intuitive to reason about which rule >>>>>>> the >>>>>>> DDL query will fall into. As I commented earlier, "ROW FORMAT" and >>>>>>> "STORED >>>>>>> AS" are the options which make DDL query fall into the second case, but >>>>>>> they're described as "optional" so it's hard to catch the gotcha. >>>>>>> >>>>>>> Furthermore, while we document the syntax as above, in reality we >>>>>>> allow "EXTERNAL" in first rule (and throw error), which ends up existing >>>>>>> DDL query "CREATE EXTERNAL TABLE ~ LOCATION" be broken. Even we add >>>>>>> "USING >>>>>>> hive" parser will throw error. It now requires "ROW FORMAT" or "STORED >>>>>>> AS". >>>>>>> >>>>>>> >>>>>>> Simply saying, do we really think end users should stop and try to >>>>>>> match their query against the parser rules (or orders when we explain in >>>>>>> the doc) by themselves to understand which provider the table will >>>>>>> leverage? I'm sorry but I think we are making bad assumption on end >>>>>>> users >>>>>>> which is a serious problem. >>>>>>> >>>>>>> If we really want to promote Spark's one for CREATE TABLE, then >>>>>>> would it really matter to treat Hive CREATE TABLE be "exceptional" one >>>>>>> and >>>>>>> try to isolate each other? What's the point of providing a legacy >>>>>>> config to >>>>>>> go back to the old one even we fear about breaking something to make it >>>>>>> better or clearer? We do think that table provider is important (hence >>>>>>> the >>>>>>> change was done), then is it still a trivial problem whether the >>>>>>> provider >>>>>>> is affected by specifying the "optional" fields? >>>>>>> >>>>>>> >>>>>>> On Wed, Mar 18, 2020 at 4:38 PM Wenchen Fan <cloud0...@gmail.com> >>>>>>> wrote: >>>>>>> >>>>>>>> I think the general guideline is to promote Spark's own CREATE >>>>>>>> TABLE syntax instead of the Hive one. Previously these two rules are >>>>>>>> mutually exclusive because the native syntax requires the USING clause >>>>>>>> while the Hive syntax makes ROW FORMAT or STORED AS clause optional. >>>>>>>> >>>>>>>> It's a good move to make the USING clause optional, which makes it >>>>>>>> easier to write the native CREATE TABLE syntax. Unfortunately, it >>>>>>>> leads to >>>>>>>> some conflicts with the Hive CREATE TABLE syntax, but I don't see a >>>>>>>> serious >>>>>>>> problem here. If a user just writes CREATE TABLE without USING or ROW >>>>>>>> FORMAT or STORED AS, does it matter what table we create? Internally >>>>>>>> the >>>>>>>> parser rules conflict and we pick the native syntax depending on the >>>>>>>> rule >>>>>>>> order. But the user-facing behavior looks fine. >>>>>>>> >>>>>>>> CREATE EXTERNAL TABLE is a problem as it works in 2.4 but not in >>>>>>>> 3.0. Shall we simply remove EXTERNAL from the native CREATE TABLE >>>>>>>> syntax? >>>>>>>> Then CREATE EXTERNAL TABLE creates Hive table like 2.4. >>>>>>>> >>>>>>>> On Mon, Mar 16, 2020 at 10:55 AM Jungtaek Lim < >>>>>>>> kabhwan.opensou...@gmail.com> wrote: >>>>>>>> >>>>>>>>> Hi devs, >>>>>>>>> >>>>>>>>> I'd like to initiate discussion and hear the voices for resolving >>>>>>>>> ambiguous parser rule between two "create table"s being brought by >>>>>>>>> SPARK-30098 [1]. >>>>>>>>> >>>>>>>>> Previously, "create table" parser rules were clearly distinguished >>>>>>>>> via "USING provider", which was very intuitive and deterministic. >>>>>>>>> Say, DDL >>>>>>>>> query creates "Hive" table unless "USING provider" is specified, >>>>>>>>> (Please refer the parser rule in branch-2.4 [2]) >>>>>>>>> >>>>>>>>> After SPARK-30098, "create table" parser rules became ambiguous >>>>>>>>> (please refer the parser rule in branch-3.0 [3]) - the factors >>>>>>>>> differentiating two rules are only "ROW FORMAT" and "STORED AS" which >>>>>>>>> are >>>>>>>>> all defined as "optional". Now it relies on the "order" of parser rule >>>>>>>>> which end users would have no idea to reason about, and very >>>>>>>>> unintuitive. >>>>>>>>> >>>>>>>>> Furthermore, undocumented rule of EXTERNAL (added in the first >>>>>>>>> rule to provide better message) brought more confusion (I've >>>>>>>>> described the >>>>>>>>> broken existing query via SPARK-30436 [4]). >>>>>>>>> >>>>>>>>> Personally I'd like to see two rules mutually exclusive, instead >>>>>>>>> of trying to document the difference and talk end users to be careful >>>>>>>>> about >>>>>>>>> their query. I'm seeing two ways to make rules be mutually exclusive: >>>>>>>>> >>>>>>>>> 1. Add some identifier in create Hive table rule, like `CREATE ... >>>>>>>>> "HIVE" TABLE ...`. >>>>>>>>> >>>>>>>>> pros. This is the simplest way to distinguish between two rules. >>>>>>>>> cons. This would lead end users to change their query if they >>>>>>>>> intend to create Hive table. (Given we will also provide legacy >>>>>>>>> option I'm >>>>>>>>> feeling this is acceptable.) >>>>>>>>> >>>>>>>>> 2. Define "ROW FORMAT" or "STORED AS" as mandatory one. >>>>>>>>> >>>>>>>>> pros. Less invasive for existing queries. >>>>>>>>> cons. Less intuitive, because they have been optional and now >>>>>>>>> become mandatory to fall into the second rule. >>>>>>>>> >>>>>>>>> Would like to hear everyone's voices; better ideas are welcome! >>>>>>>>> >>>>>>>>> Thanks, >>>>>>>>> Jungtaek Lim (HeartSaVioR) >>>>>>>>> >>>>>>>>> 1. SPARK-30098 Use default datasource as provider for CREATE TABLE >>>>>>>>> syntax >>>>>>>>> https://issues.apache.org/jira/browse/SPARK-30098 >>>>>>>>> 2. >>>>>>>>> https://github.com/apache/spark/blob/branch-2.4/sql/catalyst/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBase.g4 >>>>>>>>> 3. >>>>>>>>> https://github.com/apache/spark/blob/branch-3.0/sql/catalyst/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBase.g4 >>>>>>>>> 4. https://issues.apache.org/jira/browse/SPARK-30436 >>>>>>>>> >>>>>>>>> > > -- > Ryan Blue > Software Engineer > Netflix >