[ 
https://issues.apache.org/jira/browse/HIVE-29121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18012280#comment-18012280
 ] 

Seonggon Namgung commented on HIVE-29121:
-----------------------------------------

[~zabetak], regarding LoptOptimizeJoinRule, I see two possible approaches to 
handling SemiJoin nodes within it:
* Refactor HiveSemiJoin to be a subclass of HiveJoin, so that 
JoinToMultiJoinRule can include it in a MultiJoin instance.
* Revive the SemiJoin pushdown rules, which are currently commented out in 
[CalcitePlanner|https://github.com/apache/hive/blob/a805bce08a85cf4e7e1daa9678e125e5f049cf5f/ql/src/java/org/apache/hadoop/hive/ql/parse/CalcitePlanner.java#L1924].

For the first approach, I’m unclear about the original intention behind 
separating HiveSemiJoin from HiveJoin, so I’m unsure about the feasibility or 
potential side effects of this change.

The second approach seems more appropriate to me. However, the previous 
implementation relied on Calcite's built-in rules, while the current Hive logic 
introduces Hive-specific SemiJoin pushdown rules. This divergence raises some 
concerns for me. Given your deeper understanding of Calcite's planner 
framework, I would appreciate your thoughts on whether either of these 
approaches would be feasible or advisable.

In the meantime, I chose to focus on HiveSubQueryRemoveRule because, though I 
could be mistaken, it seems that HIVE-24685 unintentionally dropped the 
original logic in HiveSubQRemoveRelBuilder. While the root cause of the 
performance issue we observed is suboptimal join ordering, I’d appreciate it if 
you could check the proposed change that restores the earlier logic as a 
targeted fix for the issue we're addressing.

> Restore HiveSubQueryRemoveRule to use InnerJoin instead of SemiJoin for 
> uncorrelated IN/EXISTS subqueries with RelOptUtil.Logic.TRUE.
> -------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HIVE-29121
>                 URL: https://issues.apache.org/jira/browse/HIVE-29121
>             Project: Hive
>          Issue Type: Improvement
>         Environment: [^plan.example.txt]
>            Reporter: Seonggon Namgung
>            Assignee: Seonggon Namgung
>            Priority: Major
>              Labels: pull-request-available
>         Attachments: plan.example.txt
>
>
> This JIRA is an addendum patch to HIVE-24685 and aims to restore the compiler 
> logic from HIVE-17767.
> During the substitution of HiveSubQRemoveRelBuilder with Calcite's RelBuilder 
> in HIVE-24685, Hive was changed to always use SemiJoin when handling 
> uncorrelated IN/EXISTS subqueries with logic == RelOptUtil.Logic.TRUE. Since 
> the SemiJoin is intended for use with correlated IN/EXISTS subqueries in 
> conjunction with AGGR removal (cf. HIVE-17767), we should avoid using 
> SemiJoin for the uncorrelated case, which neither benefits from AGGR removal 
> nor allows the application of rules that cannot handle HiveSemiJoin (e.g., 
> join reordering).
> For clarity, the following combinations of query plans are attached. From the 
> attached plans, we can observe that HIVE-24685 introduces a SemiJoin without 
> removing HiveAggregate, unlike HIVE-17767.
> The attached plans cover the following combinations:
> * {Before HIVE-17767, After HIVE-17767, After HIVE-24685}
> * {Correlated, Uncorrelated}
> * {Before subquery removal, After subquery removal, After decorrelation}
> We discovered this issue while investigating a performance regression in 
> TPC-DS Query 23.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to