[jira] [Updated] (IMPALA-7838) Planner's handling of parenthesized expressions is awkward

Paul Rogers (JIRA) Thu, 08 Nov 2018 12:42:50 -0800


     [ 
https://issues.apache.org/jira/browse/IMPALA-7838?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Paul Rogers updated IMPALA-7838:
--------------------------------
    Description: 
Consider two simple queries:

{code:sql}
SELECT 2 + 3 * 4 FROM ...
SELECT (2 + 3) * 4 FROM ...
{code}

The parenthesis are required in infix notation to override default precedence 
rules. When parsed, the difference shows up as different tree structures:

{noformat}
(+ 2 (* 3 4))
(* (+ 2 3) 4)
{noformat}

Impala's expression nodes often wish to render the parse nodes back to SQL. A 
simple traversal will produce {{2 + 3 * 4}} in both cases, which is wrong. A 
fully parenthesized version would be correct, but a nuisance:

{noformat}
(2 + (3 * 4))
((2 + 3) * 4)
{noformat}

To recreate the user's original parenthisization, Impala tags each expression 
node with whether it was wrapped in parenthesis when parsed.

All of this works as long as we leave the parse tree unchanged. But, if we do 
rewrites, we end up in a very confused state. For example, after constant 
folding should the above be enclosed in parenthesis or not?

Take a more complex case:

{code:sql}
SELECT * FROM foo, bar ON ... WHERE (foo.a > 10) AND (bar.b = 20)
{code}

Do we need to preserve the parenthesis in the emitted plan? In the above, no, 
we don't: they don't add information. Yet, the planner tries to preserve them.

{noformat}
    predicates: (foo.a > 10)
{noformat}

By contrast:

{code:sql}
SELECT * FROM foo, bar ON ... WHERE foo.a > 10 AND bar.b = 20
{code}

Produces:

{noformat}
    predicates: foo.a > 10
{noformat}

Yet, functionally, the two are identical: the plan differences are unnecessary 
and are just noise.

Again, if a rewrite occurs, it is not clear whether parenthesis should or 
should not be preserved.

Today, a rewrite discards parenthesis (the rewritten node never has the 
{{printSqlInParenthesis}} flag set.) Yet, that could, conceivably, change the 
meaning of an expression when printed:

{noformat}
a AND (b OR c OR FALSE) --> a AND b OR c -- what the user sees
(a AND (b OR c OR FALSE)) --> (a AND (b OR c)) -- parse nodes
(a AND (b OR c OR FALSE)) --> ((a AND b) OR c) -- user's parse
{noformat}

The first is what the user would see, the second is how the parse tree 
represents the rewritten statement, the third is how the user would interpret 
the {{toSql()}} form with missing parenthesis.

The problem is, we are approaching the problem incorrectly. Since we perform 
rewrites, we should use parenthesis only when to override precedence:

{noformat}
foo.a + foo.b * 3
(foo.a + foo.b) * 3
{noformat}

That is, parenthesis should be generated based on precedence rules, *not* based 
on the original SQL source text. That way, parenthesis will be both consistent 
and accurate in the emitted, rewritten expressions.

  was:
Consider two simple queries:

{code:sql}
SELECT 2 + 3 * 4 FROM ...
SELECT (2 + 3) * 4 FROM ...
{code}

The parenthesis are required in infix notation to override default precedence 
rules. When parsed, the difference shows up as different tree structures:

{noformat}
(+ 2 (* 3 4))
(* (+ 2 3) 4)
{noformat}

Impala's expression nodes often wish to render the parse nodes back to SQL. A 
simple traversal will produce {{2 + 3 * 4}} in both cases, which is wrong. A 
fully parenthesized version would be correct, but a nuisance:

{noformat}
(2 + (3 * 4))
((2 + 3) * 4)
{noformat}

To recreate the user's original parenthisization, Impala tags each expression 
node with whether it was wrapped in parenthesis when parsed.

All of this works as long as we leave the parse tree unchanged. But, if we do 
rewrites, we end up in a very confused state. For example, after constant 
folding should the above be enclosed in parenthesis or not?

Take a more complex case:

{code:sql}
SELECT * FROM foo, bar ON ... WHERE (foo.a > 10) AND (bar.b = 20)
{code}

Do we need to preserve the parenthesis in the emitted plan? In the above, no, 
we don't: they don't add information. Yet, the planner tries to preserve them.

{noformat}
    predicates: (foo.a > 10)
{noformat}

By contrast:

{code:sql}
SELECT * FROM foo, bar ON ... WHERE foo.a > 10 AND bar.b = 20
{code}

Produces:

{noformat}
    predicates: foo.a > 10
{noformat}

Yet, functionally, the two are identical: the plan differences are unnecessary 
and are just noise.

Again, if a rewrite occurs, it is not clear whether parenthesis should or 
should not be preserved.

The problem is, we are approaching the problem incorrectly. Since we perform 
rewrites, we should use parenthesis only when to override precedence:

{noformat}
foo.a + foo.b * 3
(foo.a + foo.b) * 3
{noformat}

That is, parenthesis should be generated based on precedence rules, *not* based 
on the original SQL source text. That way, parenthesis will be both consistent 
and accurate in the emitted, rewritten expressions.


> Planner's handling of parenthesized expressions is awkward
> ----------------------------------------------------------
>
>                 Key: IMPALA-7838
>                 URL: https://issues.apache.org/jira/browse/IMPALA-7838
>             Project: IMPALA
>          Issue Type: Improvement
>          Components: Frontend
>    Affects Versions: Impala 3.0
>            Reporter: Paul Rogers
>            Priority: Minor
>
> Consider two simple queries:
> {code:sql}
> SELECT 2 + 3 * 4 FROM ...
> SELECT (2 + 3) * 4 FROM ...
> {code}
> The parenthesis are required in infix notation to override default precedence 
> rules. When parsed, the difference shows up as different tree structures:
> {noformat}
> (+ 2 (* 3 4))
> (* (+ 2 3) 4)
> {noformat}
> Impala's expression nodes often wish to render the parse nodes back to SQL. A 
> simple traversal will produce {{2 + 3 * 4}} in both cases, which is wrong. A 
> fully parenthesized version would be correct, but a nuisance:
> {noformat}
> (2 + (3 * 4))
> ((2 + 3) * 4)
> {noformat}
> To recreate the user's original parenthisization, Impala tags each expression 
> node with whether it was wrapped in parenthesis when parsed.
> All of this works as long as we leave the parse tree unchanged. But, if we do 
> rewrites, we end up in a very confused state. For example, after constant 
> folding should the above be enclosed in parenthesis or not?
> Take a more complex case:
> {code:sql}
> SELECT * FROM foo, bar ON ... WHERE (foo.a > 10) AND (bar.b = 20)
> {code}
> Do we need to preserve the parenthesis in the emitted plan? In the above, no, 
> we don't: they don't add information. Yet, the planner tries to preserve them.
> {noformat}
>     predicates: (foo.a > 10)
> {noformat}
> By contrast:
> {code:sql}
> SELECT * FROM foo, bar ON ... WHERE foo.a > 10 AND bar.b = 20
> {code}
> Produces:
> {noformat}
>     predicates: foo.a > 10
> {noformat}
> Yet, functionally, the two are identical: the plan differences are 
> unnecessary and are just noise.
> Again, if a rewrite occurs, it is not clear whether parenthesis should or 
> should not be preserved.
> Today, a rewrite discards parenthesis (the rewritten node never has the 
> {{printSqlInParenthesis}} flag set.) Yet, that could, conceivably, change the 
> meaning of an expression when printed:
> {noformat}
> a AND (b OR c OR FALSE) --> a AND b OR c -- what the user sees
> (a AND (b OR c OR FALSE)) --> (a AND (b OR c)) -- parse nodes
> (a AND (b OR c OR FALSE)) --> ((a AND b) OR c) -- user's parse
> {noformat}
> The first is what the user would see, the second is how the parse tree 
> represents the rewritten statement, the third is how the user would interpret 
> the {{toSql()}} form with missing parenthesis.
> The problem is, we are approaching the problem incorrectly. Since we perform 
> rewrites, we should use parenthesis only when to override precedence:
> {noformat}
> foo.a + foo.b * 3
> (foo.a + foo.b) * 3
> {noformat}
> That is, parenthesis should be generated based on precedence rules, *not* 
> based on the original SQL source text. That way, parenthesis will be both 
> consistent and accurate in the emitted, rewritten expressions.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org

[jira] [Updated] (IMPALA-7838) Planner's handling of parenthesized expressions is awkward

Reply via email to