[ 
https://issues.apache.org/jira/browse/IMPALA-7927?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Paul Rogers updated IMPALA-7927:
--------------------------------
    Description: 
A recent enhancement added a helpful feature: showing the rewritten SQL as part 
of the plan included in test cases such as PlannerTest. The rewritten SQL 
visualizes transforms that previously were hidden from view.

The current version can be made even more useful by addressing a few 
limitations.
h4. SELECT clause shows rewritten SQL, but HAVING does not

Due to nuances of the code, the SELECT statement maintains two copies of the 
HAVING clause: one before rewrites and alias substitutions, the other after. 
The toSQL() method always uses the before-rewrites version, even with the SQL 
options are set to show rewritten SQL. In this case, enhance the code to show 
the rewritten version instead.

Example of current version:

{code:sql}
HAVING CAST(1048576 AS BIGINT) * count(*) % CAST(2 AS
BIGINT) = CAST(0 AS BIGINT)
{code}

Example of enhanced version:

{code:sql}
HAVING CAST(1048576 AS BIGINT) * zeroifnull(count(*)) %
CAST(2 AS BIGINT) = CAST(0 AS BIGINT)
{code}

Notice that the enhanced version shows the actual rewritten SQL, including the 
{{zeroifnull()}} method inserted by the analyzer.
h4. Show column references directly, not as expressions

The analyzer performs a pass over the HAVING (and other) clauses to replace 
expressions in HAVING with references to the same expression materialized in 
the result set.

At the same time, Impala creates labels for result set columns that show the 
expression:

{code:sql}
SELECT id + 2 FROM alltypestiny;
{code}
{noformat}
+--------+
| id + 2 |
+--------+
| 2      |
…
{noformat}

Combine these two, and the rewritten SQL shown in the plan is misleading: 
column references are replaced with the label, which is defined as the 
unrewitten SQL without implicit casts. Original query:

{code:sql}
select sum(1 + 1 + id)
from functional.alltypes
group by timestamp_col = cast('2015-11-15' as timestamp) + interval 1 year
having 1024 * 1024 * count(*) % 2 = 0
  and (sum(1 + 1 + id) > 1 or sum(1 + 1 + id) > 1)
  and (sum(1 + 1 + id) between 5 and 10)
{code}

Rewritten form shown in the test file:

{code:sql}
Analyzed query:
SELECT sum(CAST(2 AS BIGINT) + CAST(id AS BIGINT))
FROM functional.alltypes
GROUP BY timestamp_col = TIMESTAMP '2016-11-15 00:00:00'
HAVING CAST(1048576 AS BIGINT) * count(*) % CAST(2 AS BIGINT) = CAST(0 AS
BIGINT)
AND sum(2 + id) > CAST(1 AS BIGINT)
AND sum(2 + id) >= CAST(5 AS BIGINT)
AND sum(2 + id) <= CAST(10 AS BIGINT)
{code}

Note that the original {{sum(1 + 1 + id)}} is shown rewritten in the {{SELECT}} 
clause. Note that references to that expression in {{HAVING}} have been 
rewritten and are shown with the label, which is the rewritten form, but 
without implicit casts: {{sum(2 + id)}}. Someone versed in the art will 
identify these as column references, but it is a bit hard for newbies.

Better would be to clearly show such references, using a Impala internal 
syntax, with the expression as a comment:

{code:sql}
HAVING CAST(1048576 AS BIGINT) * $ao$2 /* count(*) */ % CAST(2 AS BIGINT) =
CAST(0 AS BIGINT) 
AND $ao$1 /* sum(2 + id) */ > CAST(1 AS BIGINT) 
AND $ao$1 /*sum(2 + id) */ >= CAST(5 AS BIGINT) 
AND $ao$1 /* sum(2 + id) */ <= CAST(10 AS BIGINT)
{code}

Here, {{$ao$i}} refers to the aggregate output row. The two-letter form avoids 
conflict with the generated one-letter aliases used elsewhere in Impala.

Note that these changes apply *only* when using the option to produce rewritten 
SQL, which is used only when generating the “Analyzed Query” section of the 
test file. Users will never see this change.

  was:
A recent enhancement added a helpful feature: showing the rewritten SQL as part 
of the plan included in test cases such as PlannerTest. The rewritten SQL 
visualizes transforms that previously were hidden from view.

The current version can be made even more useful by addressing a few 
limitations.

h4. SELECT clause shows rewritten SQL, but HAVING does not

Due to nuances of the code, the SELECT statement maintains two copies of the 
HAVING clause: one before rewrites and alias substitutions, the other after. 
The toSQL() method always uses the before-rewrites version, even with the SQL 
options are set to show rewritten SQL. In this case, enhance the code to show 
the rewritten version instead.

Example of current version:

{code:sql}
HAVING CAST(1048576 AS BIGINT) * count(*) % CAST(2 AS
BIGINT) = CAST(0 AS BIGINT)
{code}

Example of enhanced version:

{code:sql}
HAVING CAST(1048576 AS BIGINT) * zeroifnull(count(*)) %
CAST(2 AS BIGINT) = CAST(0 AS BIGINT)
{code}

Notice that the enhanced version shows the actual rewritten SQL, including the 
{{zeroifnull()}} method inserted by the analyzer.

h4. Show column references directly, not as expressions

The analyzer performs a pass over the HAVING (and other) clauses to replace 
expressions in HAVING with references to the same expression materialized in 
the result set.

At the same time, Impala creates labels for result set columns that show the 
expression:

{code:sql}
SELECT id + 2 FROM alltypestiny;
{code}

{noformat}
+--------+
| id + 2 |
+--------+
| 2      |
…
{noformat} 

Combine these two, and the rewritten SQL shown in the plan is misleading: 
column references are replaced with the label, which is defined as the 
unrewitten SQL without implicit casts. Original query:

{code:sql}
select sum(1 + 1 + id)
from functional.alltypes
group by timestamp_col = cast('2015-11-15' as timestamp) + interval 1 year
having 1024 * 1024 * count(*) % 2 = 0
  and (sum(1 + 1 + id) > 1 or sum(1 + 1 + id) > 1)
  and (sum(1 + 1 + id) between 5 and 10)
{code}

Rewritten form shown in the test file:

{code:sql}
Analyzed query:
SELECT sum(CAST(2 AS BIGINT) + CAST(id AS BIGINT))
FROM functional.alltypes
GROUP BY timestamp_col = TIMESTAMP '2016-11-15 00:00:00'
HAVING CAST(1048576 AS BIGINT) * count(*) % CAST(2 AS BIGINT) = CAST(0 AS
BIGINT)
AND sum(2 + id) > CAST(1 AS BIGINT)
AND sum(2 + id) >= CAST(5 AS BIGINT)
AND sum(2 + id) <= CAST(10 AS BIGINT)
{code}

Note that the original {{sum(1 + 1 + id)}} is shown rewritten in the {{SELECT}} 
clause. Note that references to that expression in {{HAVING}} have been 
rewritten and are shown with the label, which is the rewritten form, but 
without implicit casts: {{sum(2 + id)}}. Someone versed in the art will 
identify these as column references, but it is a bit hard for newbies.

Better would be to clearly show such references, perhaps using a modified Hive 
syntax, with the expression as a comment:

{code:sql}
HAVING CAST(1048576 AS BIGINT) * _o2 /* count(*) */ % CAST(2 AS BIGINT) = CAST(0
AS BIGINT)
AND _o1 /* sum(2 + id) */ > CAST(1 AS BIGINT)
AND _o1 /* sum(2 + id) */ >= CAST(5 AS BIGINT)
AND _o1 /* sum(2 + id) */ <= CAST(10 AS BIGINT)
{code}

Here, {{_oi}} refers to the output row, leaving the Hive labels, {{_ci}} to 
refer to the select items.

Note that these changes apply *only* when using the option to produce rewritten 
SQL, which is used only when generating the “Analyzed Query” section of the 
test file. Users will never see this change. 


> Enhance Rewritten SQL in test files
> -----------------------------------
>
>                 Key: IMPALA-7927
>                 URL: https://issues.apache.org/jira/browse/IMPALA-7927
>             Project: IMPALA
>          Issue Type: Improvement
>          Components: Frontend
>    Affects Versions: Impala 3.0
>            Reporter: Paul Rogers
>            Assignee: Paul Rogers
>            Priority: Minor
>
> A recent enhancement added a helpful feature: showing the rewritten SQL as 
> part of the plan included in test cases such as PlannerTest. The rewritten 
> SQL visualizes transforms that previously were hidden from view.
> The current version can be made even more useful by addressing a few 
> limitations.
> h4. SELECT clause shows rewritten SQL, but HAVING does not
> Due to nuances of the code, the SELECT statement maintains two copies of the 
> HAVING clause: one before rewrites and alias substitutions, the other after. 
> The toSQL() method always uses the before-rewrites version, even with the SQL 
> options are set to show rewritten SQL. In this case, enhance the code to show 
> the rewritten version instead.
> Example of current version:
> {code:sql}
> HAVING CAST(1048576 AS BIGINT) * count(*) % CAST(2 AS
> BIGINT) = CAST(0 AS BIGINT)
> {code}
> Example of enhanced version:
> {code:sql}
> HAVING CAST(1048576 AS BIGINT) * zeroifnull(count(*)) %
> CAST(2 AS BIGINT) = CAST(0 AS BIGINT)
> {code}
> Notice that the enhanced version shows the actual rewritten SQL, including 
> the {{zeroifnull()}} method inserted by the analyzer.
> h4. Show column references directly, not as expressions
> The analyzer performs a pass over the HAVING (and other) clauses to replace 
> expressions in HAVING with references to the same expression materialized in 
> the result set.
> At the same time, Impala creates labels for result set columns that show the 
> expression:
> {code:sql}
> SELECT id + 2 FROM alltypestiny;
> {code}
> {noformat}
> +--------+
> | id + 2 |
> +--------+
> | 2      |
> …
> {noformat}
> Combine these two, and the rewritten SQL shown in the plan is misleading: 
> column references are replaced with the label, which is defined as the 
> unrewitten SQL without implicit casts. Original query:
> {code:sql}
> select sum(1 + 1 + id)
> from functional.alltypes
> group by timestamp_col = cast('2015-11-15' as timestamp) + interval 1 year
> having 1024 * 1024 * count(*) % 2 = 0
>   and (sum(1 + 1 + id) > 1 or sum(1 + 1 + id) > 1)
>   and (sum(1 + 1 + id) between 5 and 10)
> {code}
> Rewritten form shown in the test file:
> {code:sql}
> Analyzed query:
> SELECT sum(CAST(2 AS BIGINT) + CAST(id AS BIGINT))
> FROM functional.alltypes
> GROUP BY timestamp_col = TIMESTAMP '2016-11-15 00:00:00'
> HAVING CAST(1048576 AS BIGINT) * count(*) % CAST(2 AS BIGINT) = CAST(0 AS
> BIGINT)
> AND sum(2 + id) > CAST(1 AS BIGINT)
> AND sum(2 + id) >= CAST(5 AS BIGINT)
> AND sum(2 + id) <= CAST(10 AS BIGINT)
> {code}
> Note that the original {{sum(1 + 1 + id)}} is shown rewritten in the 
> {{SELECT}} clause. Note that references to that expression in {{HAVING}} have 
> been rewritten and are shown with the label, which is the rewritten form, but 
> without implicit casts: {{sum(2 + id)}}. Someone versed in the art will 
> identify these as column references, but it is a bit hard for newbies.
> Better would be to clearly show such references, using a Impala internal 
> syntax, with the expression as a comment:
> {code:sql}
> HAVING CAST(1048576 AS BIGINT) * $ao$2 /* count(*) */ % CAST(2 AS BIGINT) =
> CAST(0 AS BIGINT) 
> AND $ao$1 /* sum(2 + id) */ > CAST(1 AS BIGINT) 
> AND $ao$1 /*sum(2 + id) */ >= CAST(5 AS BIGINT) 
> AND $ao$1 /* sum(2 + id) */ <= CAST(10 AS BIGINT)
> {code}
> Here, {{$ao$i}} refers to the aggregate output row. The two-letter form 
> avoids conflict with the generated one-letter aliases used elsewhere in 
> Impala.
> Note that these changes apply *only* when using the option to produce 
> rewritten SQL, which is used only when generating the “Analyzed Query” 
> section of the test file. Users will never see this change.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to