Paul Rogers created IMPALA-7927:
-----------------------------------
Summary: Enhance Rewritten SQL in test files
Key: IMPALA-7927
URL: https://issues.apache.org/jira/browse/IMPALA-7927
Project: IMPALA
Issue Type: Improvement
Components: Frontend
Affects Versions: Impala 3.0
Reporter: Paul Rogers
Assignee: Paul Rogers
A recent enhancement added a helpful feature: showing the rewritten SQL as part
of the plan included in test cases such as PlannerTest. The rewritten SQL
visualizes transforms that previously were hidden from view.
The current version can be made even more useful by addressing a few
limitations.
h4. SELECT clause shows rewritten SQL, but HAVING does not
Due to nuances of the code, the SELECT statement maintains two copies of the
HAVING clause: one before rewrites and alias substitutions, the other after.
The toSQL() method always uses the before-rewrites version, even with the SQL
options are set to show rewritten SQL. In this case, enhance the code to show
the rewritten version instead.
Example of current version:
{code:sql}
HAVING CAST(1048576 AS BIGINT) * count(*) % CAST(2 AS
BIGINT) = CAST(0 AS BIGINT)
{code}
Example of enhanced version:
{code:sql}
HAVING CAST(1048576 AS BIGINT) * zeroifnull(count(*)) %
CAST(2 AS BIGINT) = CAST(0 AS BIGINT)
{code}
Notice that the enhanced version shows the actual rewritten SQL, including the
{{zeroifnull()}} method inserted by the analyzer.
h4. Show column references directly, not as expressions
The analyzer performs a pass over the HAVING (and other) clauses to replace
expressions in HAVING with references to the same expression materialized in
the result set.
At the same time, Impala creates labels for result set columns that show the
expression:
{code:sql}
SELECT id + 2 FROM alltypestiny;
{code}
{noformat}
+--------+
| id + 2 |
+--------+
| 2 |
…
{noformat}
Combine these two, and the rewritten SQL shown in the plan is misleading:
column references are replaced with the label, which is defined as the
unrewitten SQL without implicit casts. Original query:
{code:sql}
select sum(1 + 1 + id)
from functional.alltypes
group by timestamp_col = cast('2015-11-15' as timestamp) + interval 1 year
having 1024 * 1024 * count(*) % 2 = 0
and (sum(1 + 1 + id) > 1 or sum(1 + 1 + id) > 1)
and (sum(1 + 1 + id) between 5 and 10)
{code}
Rewritten form shown in the test file:
{code:sql}
Analyzed query:
SELECT sum(CAST(2 AS BIGINT) + CAST(id AS BIGINT))
FROM functional.alltypes
GROUP BY timestamp_col = TIMESTAMP '2016-11-15 00:00:00'
HAVING CAST(1048576 AS BIGINT) * count(*) % CAST(2 AS BIGINT) = CAST(0 AS
BIGINT)
AND sum(2 + id) > CAST(1 AS BIGINT)
AND sum(2 + id) >= CAST(5 AS BIGINT)
AND sum(2 + id) <= CAST(10 AS BIGINT)
{code}
Note that the original {{sum(1 + 1 + id)}} is shown rewritten in the {{SELECT}}
clause. Note that references to that expression in {{HAVING}} have been
rewritten and are shown with the label, which is the rewritten form, but
without implicit casts: {{sum(2 + id)}}. Someone versed in the art will
identify these as column references, but it is a bit hard for newbies.
Better would be to clearly show such references, perhaps using a modified Hive
syntax, with the expression as a comment:
{code:sql}
HAVING CAST(1048576 AS BIGINT) * _o2 /* count(*) */ % CAST(2 AS BIGINT) = CAST(0
AS BIGINT)
AND _o1 /* sum(2 + id) */ > CAST(1 AS BIGINT)
AND _o1 /* sum(2 + id) */ >= CAST(5 AS BIGINT)
AND _o1 /* sum(2 + id) */ <= CAST(10 AS BIGINT)
{code}
Here, {{_oi}} refers to the output row, leaving the Hive labels, {{_ci}} to
refer to the select items.
Note that these changes apply *only* when using the option to produce rewritten
SQL, which is used only when generating the “Analyzed Query” section of the
test file. Users will never see this change.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]