[
https://issues.apache.org/jira/browse/HIVE-24606?focusedWorklogId=536485&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-536485
]
ASF GitHub Bot logged work on HIVE-24606:
-----------------------------------------
Author: ASF GitHub Bot
Created on: 15/Jan/21 13:46
Start Date: 15/Jan/21 13:46
Worklog Time Spent: 10m
Work Description: okumin opened a new pull request #1873:
URL: https://github.com/apache/hive/pull/1873
### What changes were proposed in this pull request?
Build correct dependencies among CTEs in order to prevent wrong results.
### Why are the changes needed?
A Hive query with `hive.optimize.cte.materialize.threshold` can return wrong
results when it has complex CTEs.
The issue can happen when multistage CTEs have dependencies and
SemanticAnalyzer fails to link their tasks.
https://issues.apache.org/jira/browse/HIVE-24606
### Does this PR introduce _any_ user-facing change?
No. Just a bug fix.
### How was this patch tested?
This PR adds three test cases. All fail when using the latest
revision(`58552a0c6b42988efb5160b045a3bf985477f117`).
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
Issue Time Tracking
-------------------
Worklog Id: (was: 536485)
Remaining Estimate: 0h
Time Spent: 10m
> Multi-stage materialized CTEs can lose intermediate data
> --------------------------------------------------------
>
> Key: HIVE-24606
> URL: https://issues.apache.org/jira/browse/HIVE-24606
> Project: Hive
> Issue Type: Bug
> Components: Query Planning
> Affects Versions: 2.3.7, 3.1.2, 4.0.0
> Reporter: okumin
> Assignee: okumin
> Priority: Major
> Time Spent: 10m
> Remaining Estimate: 0h
>
> With complex multi-stage CTEs, Hive can start a latter stage before its
> previous stage finishes.
> That's because `SemanticAnalyzer#toRealRootTasks` can fail to resolve
> dependency between multistage materialized CTEs when a non-materialized CTE
> cuts in.
>
> [https://github.com/apache/hive/blob/425e1ff7c054f87c4db87e77d004282d529599ae/ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java#L1414]
>
> For example, when submitting this query,
> {code:sql}
> SET hive.optimize.cte.materialize.threshold=2;
> SET hive.optimize.cte.materialize.full.aggregate.only=false;
> WITH x AS ( SELECT 'x' AS id ), -- not materialized
> a1 AS ( SELECT 'a1' AS id ), -- materialized by a2 and the root
> a2 AS ( SELECT 'a2 <- ' || id AS id FROM a1) -- materialized by the root
> SELECT * FROM a1
> UNION ALL
> SELECT * FROM x
> UNION ALL
> SELECT * FROM a2
> UNION ALL
> SELECT * FROM a2;
> {code}
> `toRealRootTask` will traverse the CTEs in order of `a1`, `x`, and `a2`. It
> means the dependency between `a1` and `a2` will be ignored and `a2` can start
> without waiting for `a1`. As a result, the above query returns the following
> result.
> {code:java}
> +-----+
> | id |
> +-----+
> | a1 |
> | x |
> +-----+
> {code}
> For your information, I ran this test with revision =
> 425e1ff7c054f87c4db87e77d004282d529599ae.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)