Re: [DISCUSS] Resolve ambiguous parser rule between two "create table"s

Nicholas Chammas Wed, 18 Mar 2020 19:07:16 -0700

Side comment: The current docs for CREATE TABLE
<https://github.com/apache/spark/blob/4237251861c79f3176de7cf5232f0388ec5d946e/docs/sql-ref-syntax-ddl-create-table.md#description>
add to the confusion by describing the Hive-compatible command as "CREATE
TABLE USING HIVE FORMAT", but neither "USING" nor "HIVE FORMAT" are
actually part of the syntax
<https://github.com/apache/spark/blob/4237251861c79f3176de7cf5232f0388ec5d946e/docs/sql-ref-syntax-ddl-create-table-hiveformat.md>
.


On Wed, Mar 18, 2020 at 8:31 PM Ryan Blue <rb...@netflix.com.invalid> wrote:

> Jungtaek, it sounds like you consider the two rules to be separate
> syntaxes with their own consistency rules. For example, if I am using the
> Hive syntax rule, then the PARTITIONED BY clause adds new (partition)
> columns and requires types for those columns; if I’m using the Spark syntax
> rule with USING then PARTITIONED BY must reference existing columns and
> cannot include types.
>
> I agree that this is confusing to users! We should fix it, but I don’t
> think the right solution is to continue to have two rules with divergent
> syntax.
>
> This is confusing to users because they don’t know anything about separate
> parser rules. All the user sees is that sometimes PARTITION BY requires
> types and sometimes it doesn’t. Yes, we could add a keyword, HIVE, to
> signal that the syntax is borrowed from Hive for that case, but that
> actually breaks queries that run in Hive.
>
> I think the right solution is to unify the two syntaxes. I don’t think
> they are so different that it isn’t possible. Here are the differences I
> see:
>
>    - Only in Hive:
>       - EXTERNAL
>       - skewSpec: SKEWED BY ...
>       - rowFormat: ROW FORMAT DELIMITED ..., ROW FORMAT SERDE ...
>       - createFileFormat: STORED AS ...
>    - Only in Spark:
>       - OPTIONS property list
>    - Different syntax/interpretation:
>       - PARTITIONED BY transformList / PARTITIONED BY colTypeList
>
> For the clauses that are supported in one but not the other, we can add
> them to a unified rule as optional clauses. The AST builder would then
> validate what makes sense or not (e.g., stored as with using or row format
> delimited) and finally pass the remaining data on using the
> CreateTableStatement. That statement would be handled like we do for the
> Spark rule today, but with extra metadata to pass along. This is also a
> step toward being able to put Hive behind the DSv2 API because we’d be able
> to pass all of the Hive metadata clauses to the v2 catalog.
>
> The only difficult part is handling PARTITIONED BY. But in that case, we
> can use two different syntaxes from the same CREATE TABLE rule. If types
> are included, we use the Hive PARTITIONED BY syntax and convert in the
> AST builder to normalize to a single representation.
>
> What do you both think? This would make the behavior more clear and take a
> step toward getting rid of Hive-specific code.
>
> On Wed, Mar 18, 2020 at 4:45 PM Jungtaek Lim <kabhwan.opensou...@gmail.com>
> wrote:
>
>> I'm trying to understand the reason you have been suggesting to keep the
>> real thing unchanged but change doc instead. Could you please elaborate
>> why? End users would blame us when they hit the case their query doesn't
>> work as intended (1) and found the fact it's undocumented (2) and hard to
>> understand even from the Spark codebase (3).
>>
>> For me, addressing the root issue adopting your suggestion would be
>> "dropping the rule 2" and only supporting it with legacy config on. We
>> would say to end users, you need to enable the legacy config to leverage
>> Hive create table syntax, or just use beeline with Hive connected.
>>
>> But since we are even thinking about native syntax as a first class and
>> dropping Hive one implicitly (hide in doc) or explicitly, does it really
>> matter we require a marker (like "HIVE") in rule 2 and isolate it? It would
>> have even less confusion than Spark 2.x, since we will require end users to
>> fill the marker regarding Hive when creating Hive table, easier to classify
>> than "USING provider".
>>
>> If we think native syntax would cover many cases end users have been
>> creating Hive table in Spark (say, USING hive would simply work for them),
>> I'm OK to drop the rule 2 and lead end users to enable the legacy config if
>> really needed. If not, let's continue "fixing" the issue.
>>
>> (Another valid approach would be consolidating two rules into one, and
>> defining support of parameters per provider, e.g. EXTERNAL, STORED AS, ROW
>> FORMAT, etc. are only supported in Hive provider.)
>>
>>
>> On Wed, Mar 18, 2020 at 8:47 PM Wenchen Fan <cloud0...@gmail.com> wrote:
>>
>>> The fact that we have 2 CREATE TABLE syntax is already confusing many
>>> users. Shall we only document the native syntax? Then users don't need to
>>> worry about which rule their query fits and they don't need to spend a lot
>>> of time understanding the subtle difference between these 2 syntaxes.
>>>
>>> On Wed, Mar 18, 2020 at 7:01 PM Jungtaek Lim <
>>> kabhwan.opensou...@gmail.com> wrote:
>>>
>>>> A bit correction: the example I provided for vice versa is not really a
>>>> correct case for vice versa. It's actually same case (intended to use rule
>>>> 2 which is not default) but different result.
>>>>
>>>> On Wed, Mar 18, 2020 at 7:22 PM Jungtaek Lim <
>>>> kabhwan.opensou...@gmail.com> wrote:
>>>>
>>>>> My concern is that although we simply think about the change to
>>>>> mark "USING provider" to be optional in rule 1, but in reality the change
>>>>> is most likely swapping the default rule for CREATE TABLE, which was "rule
>>>>> 2", and now "rule 1" (it would be the happy case of migration doc if the
>>>>> swap happens as intended), and there're still couple of things which make
>>>>> the query still fall into rule 2 which is non-trivial to reason about and
>>>>> also not easy to explain.
>>>>>
>>>>> I only mentioned ROW FORMAT and STORED AS as the case to fall into
>>>>> rule 2 to simplify the problem statement, but they're not only the case
>>>>> - using col_name1:col_type1 would make the query fall into rule 2
>>>>> regardless of any other properties.
>>>>>
>>>>> What's the matter? In Spark 2.x, if end users want to use rule 1
>>>>> (which is not default) and specify the parameters which are only available
>>>>> in rule 1, it clearly requires "USING provider" - parser will throw error
>>>>> if there're any mistakes on specifying parameters. End users could set
>>>>> intention clearly which rule the query should be bound. If the query fails
>>>>> to bind the rule as intended, it's simply denied.
>>>>>
>>>>> In Spark 3.x, parser may not help figuring out the case where end
>>>>> users intend to use rule 2 (which is not default) but some mistake on
>>>>> specifying parameters - it could just "silently" bound into rule 1 and it
>>>>> may be even executed without any error. Vice versa happens, but in odd way
>>>>> - e.g. CREATE EXTERNAL TABLE ... LOCATION fails with weird message CREATE
>>>>> EXTERNAL TABLE is not supported.
>>>>>
>>>>> It's deterministic for end users only if they're fully understand the
>>>>> difference between the twos and also understand how Spark would apply the
>>>>> rules to make the query fall into one of the rule. I'm not sure how we can
>>>>> only improve documentation to make things be clear, but if the approach
>>>>> would be explaining the difference of rules and guide the tip to make the
>>>>> query be bound to the specific rule, the same could be applied to parser
>>>>> rule to address the root cause.
>>>>>
>>>>>
>>>>> On Wed, Mar 18, 2020 at 6:24 PM Wenchen Fan <cloud0...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Document-wise, yes, it's confusing as a simple CREATE TABLE fits both
>>>>>> native and Hive syntax. I'm fine with some changes to make it less
>>>>>> confusing, as long as the user-facing behavior doesn't change. For 
>>>>>> example,
>>>>>> define "ROW FORMAT" or "STORED AS" as mandatory only if the legacy config
>>>>>> is false.
>>>>>>
>>>>>> I still don't get your point about what's the real problem to end
>>>>>> users. There is no ambiguity as the behavior is deterministic, although 
>>>>>> we
>>>>>> rely on optional fields and rule order which is bad. It's hard to 
>>>>>> document,
>>>>>> but I don't think that's a big problem to end users.
>>>>>>
>>>>>> For the legacy config, it does make the implementation more
>>>>>> complicated, but it's invisible to most end users (we don't document it)
>>>>>> and can be super useful to some users who want the queries to keep 
>>>>>> working
>>>>>> in 3.0 without rewriting.
>>>>>>
>>>>>> If your only concern is documentation, I totally agree that we should
>>>>>> improve it.
>>>>>>
>>>>>> On Wed, Mar 18, 2020 at 4:36 PM Jungtaek Lim <
>>>>>> kabhwan.opensou...@gmail.com> wrote:
>>>>>>
>>>>>>> Thanks for sharing your view.
>>>>>>>
>>>>>>> I agree with you it's good for Spark to promote Spark's own CREATE
>>>>>>> TABLE syntax. The thing is, we still leave Hive CREATE TABLE syntax
>>>>>>> unchanged - it's being said as "convenience" but I'm not sure I can 
>>>>>>> agree
>>>>>>> with.
>>>>>>>
>>>>>>> I'll quote my comments in SPARK-31136 here again to make the problem
>>>>>>> statement be clearer:
>>>>>>>
>>>>>>> I think the parser implementation around CREATE TABLE brings
>>>>>>> ambiguity which is not documented anywhere. It wasn't ambiguous because 
>>>>>>> we
>>>>>>> forced to specify USE provider if it's not a Hive table. Now it's either
>>>>>>> default provider or Hive according to which options are provided, which
>>>>>>> seems to be non-trivial to reason about. (End users would never know, as
>>>>>>> it's completely from parser rule.)
>>>>>>>
>>>>>>> I feel this as the issue of "not breaking old behavior". The parser
>>>>>>> rule gets pretty much complicated due to support legacy config. Not
>>>>>>> breaking anything would make us be stuck eventually.
>>>>>>>
>>>>>>>
>>>>>>> https://github.com/apache/spark/blob/master/docs/sql-migration-guide.md
>>>>>>>
>>>>>>> Since Spark 3.0, CREATE TABLE without a specific provider will use
>>>>>>> the value of spark.sql.sources.default as its provider. In Spark version
>>>>>>> 2.4 and earlier, it was hive. To restore the behavior before Spark 3.0, 
>>>>>>> you
>>>>>>> can set spark.sql.legacy.createHiveTableByDefault.enabled to true.
>>>>>>>
>>>>>>>
>>>>>>> It's not true if "ROW FORMAT" / "STORED AS" are provided, and we
>>>>>>> didn't describe anything for this.
>>>>>>>
>>>>>>>
>>>>>>> https://github.com/apache/spark/blob/master/docs/sql-ref-syntax-ddl-create-table-datasource.md
>>>>>>>
>>>>>>> CREATE TABLE [ IF NOT EXISTS ] table_identifier [ ( col_name1
>>>>>>> col_type1 [ COMMENT col_comment1 ], ... ) ] [USING data_source] [ 
>>>>>>> OPTIONS (
>>>>>>> key1=val1, key2=val2, ... ) ] [ PARTITIONED BY ( col_name1, col_name2, 
>>>>>>> ...
>>>>>>> ) ] [ CLUSTERED BY ( col_name3, col_name4, ... ) [ SORTED BY ( col_name 
>>>>>>> [
>>>>>>> ASC | DESC ], ... ) ] INTO num_buckets BUCKETS ] [ LOCATION path ] [
>>>>>>> COMMENT table_comment ] [ TBLPROPERTIES ( key1=val1, key2=val2, ... ) ] 
>>>>>>> [
>>>>>>> AS select_statement ]
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> https://github.com/apache/spark/blob/master/docs/sql-ref-syntax-ddl-create-table-hiveformat.md
>>>>>>>
>>>>>>> CREATE [ EXTERNAL ] TABLE [ IF NOT EXISTS ] table_identifier [ (
>>>>>>> col_name1[:] col_type1 [ COMMENT col_comment1 ], ... ) ] [ COMMENT
>>>>>>> table_comment ] [ PARTITIONED BY ( col_name2[:] col_type2 [ COMMENT
>>>>>>> col_comment2 ], ... ) | ( col_name1, col_name2, ... ) ] [ ROW FORMAT
>>>>>>> row_format ] [ STORED AS file_format ] [ LOCATION path ] [ 
>>>>>>> TBLPROPERTIES (
>>>>>>> key1=val1, key2=val2, ... ) ] [ AS select_statement ]
>>>>>>>
>>>>>>>
>>>>>>> At least we should describe that parser will try to match the first
>>>>>>> case (create table ~ using data source), and fail back to second case; 
>>>>>>> even
>>>>>>> though we describe this it's not intuitive to reason about which rule 
>>>>>>> the
>>>>>>> DDL query will fall into. As I commented earlier, "ROW FORMAT" and 
>>>>>>> "STORED
>>>>>>> AS" are the options which make DDL query fall into the second case, but
>>>>>>> they're described as "optional" so it's hard to catch the gotcha.
>>>>>>>
>>>>>>> Furthermore, while we document the syntax as above, in reality we
>>>>>>> allow "EXTERNAL" in first rule (and throw error), which ends up existing
>>>>>>> DDL query "CREATE EXTERNAL TABLE ~ LOCATION" be broken. Even we add 
>>>>>>> "USING
>>>>>>> hive" parser will throw error. It now requires "ROW FORMAT" or "STORED 
>>>>>>> AS".
>>>>>>>
>>>>>>>
>>>>>>> Simply saying, do we really think end users should stop and try to
>>>>>>> match their query against the parser rules (or orders when we explain in
>>>>>>> the doc) by themselves to understand which provider the table will
>>>>>>> leverage? I'm sorry but I think we are making bad assumption on end 
>>>>>>> users
>>>>>>> which is a serious problem.
>>>>>>>
>>>>>>> If we really want to promote Spark's one for CREATE TABLE, then
>>>>>>> would it really matter to treat Hive CREATE TABLE be "exceptional" one 
>>>>>>> and
>>>>>>> try to isolate each other? What's the point of providing a legacy 
>>>>>>> config to
>>>>>>> go back to the old one even we fear about breaking something to make it
>>>>>>> better or clearer? We do think that table provider is important (hence 
>>>>>>> the
>>>>>>> change was done), then is it still a trivial problem whether the 
>>>>>>> provider
>>>>>>> is affected by specifying the "optional" fields?
>>>>>>>
>>>>>>>
>>>>>>> On Wed, Mar 18, 2020 at 4:38 PM Wenchen Fan <cloud0...@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> I think the general guideline is to promote Spark's own CREATE
>>>>>>>> TABLE syntax instead of the Hive one. Previously these two rules are
>>>>>>>> mutually exclusive because the native syntax requires the USING clause
>>>>>>>> while the Hive syntax makes ROW FORMAT or STORED AS clause optional.
>>>>>>>>
>>>>>>>> It's a good move to make the USING clause optional, which makes it
>>>>>>>> easier to write the native CREATE TABLE syntax. Unfortunately, it 
>>>>>>>> leads to
>>>>>>>> some conflicts with the Hive CREATE TABLE syntax, but I don't see a 
>>>>>>>> serious
>>>>>>>> problem here. If a user just writes CREATE TABLE without USING or ROW
>>>>>>>> FORMAT or STORED AS, does it matter what table we create? Internally 
>>>>>>>> the
>>>>>>>> parser rules conflict and we pick the native syntax depending on the 
>>>>>>>> rule
>>>>>>>> order. But the user-facing behavior looks fine.
>>>>>>>>
>>>>>>>> CREATE EXTERNAL TABLE is a problem as it works in 2.4 but not in
>>>>>>>> 3.0. Shall we simply remove EXTERNAL from the native CREATE TABLE 
>>>>>>>> syntax?
>>>>>>>> Then CREATE EXTERNAL TABLE creates Hive table like 2.4.
>>>>>>>>
>>>>>>>> On Mon, Mar 16, 2020 at 10:55 AM Jungtaek Lim <
>>>>>>>> kabhwan.opensou...@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> Hi devs,
>>>>>>>>>
>>>>>>>>> I'd like to initiate discussion and hear the voices for resolving
>>>>>>>>> ambiguous parser rule between two "create table"s being brought by
>>>>>>>>> SPARK-30098 [1].
>>>>>>>>>
>>>>>>>>> Previously, "create table" parser rules were clearly distinguished
>>>>>>>>> via "USING provider", which was very intuitive and deterministic. 
>>>>>>>>> Say, DDL
>>>>>>>>> query creates "Hive" table unless "USING provider" is specified,
>>>>>>>>> (Please refer the parser rule in branch-2.4 [2])
>>>>>>>>>
>>>>>>>>> After SPARK-30098, "create table" parser rules became ambiguous
>>>>>>>>> (please refer the parser rule in branch-3.0 [3]) - the factors
>>>>>>>>> differentiating two rules are only "ROW FORMAT" and "STORED AS" which 
>>>>>>>>> are
>>>>>>>>> all defined as "optional". Now it relies on the "order" of parser rule
>>>>>>>>> which end users would have no idea to reason about, and very 
>>>>>>>>> unintuitive.
>>>>>>>>>
>>>>>>>>> Furthermore, undocumented rule of EXTERNAL (added in the first
>>>>>>>>> rule to provide better message) brought more confusion (I've 
>>>>>>>>> described the
>>>>>>>>> broken existing query via SPARK-30436 [4]).
>>>>>>>>>
>>>>>>>>> Personally I'd like to see two rules mutually exclusive, instead
>>>>>>>>> of trying to document the difference and talk end users to be careful 
>>>>>>>>> about
>>>>>>>>> their query. I'm seeing two ways to make rules be mutually exclusive:
>>>>>>>>>
>>>>>>>>> 1. Add some identifier in create Hive table rule, like `CREATE ...
>>>>>>>>> "HIVE" TABLE ...`.
>>>>>>>>>
>>>>>>>>> pros. This is the simplest way to distinguish between two rules.
>>>>>>>>> cons. This would lead end users to change their query if they
>>>>>>>>> intend to create Hive table. (Given we will also provide legacy 
>>>>>>>>> option I'm
>>>>>>>>> feeling this is acceptable.)
>>>>>>>>>
>>>>>>>>> 2. Define "ROW FORMAT" or "STORED AS" as mandatory one.
>>>>>>>>>
>>>>>>>>> pros. Less invasive for existing queries.
>>>>>>>>> cons. Less intuitive, because they have been optional and now
>>>>>>>>> become mandatory to fall into the second rule.
>>>>>>>>>
>>>>>>>>> Would like to hear everyone's voices; better ideas are welcome!
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>> Jungtaek Lim (HeartSaVioR)
>>>>>>>>>
>>>>>>>>> 1. SPARK-30098 Use default datasource as provider for CREATE TABLE
>>>>>>>>> syntax
>>>>>>>>> https://issues.apache.org/jira/browse/SPARK-30098
>>>>>>>>> 2.
>>>>>>>>> https://github.com/apache/spark/blob/branch-2.4/sql/catalyst/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBase.g4
>>>>>>>>> 3.
>>>>>>>>> https://github.com/apache/spark/blob/branch-3.0/sql/catalyst/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBase.g4
>>>>>>>>> 4. https://issues.apache.org/jira/browse/SPARK-30436
>>>>>>>>>
>>>>>>>>>
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>

Re: [DISCUSS] Resolve ambiguous parser rule between two "create table"s

Reply via email to