[ 
https://issues.apache.org/jira/browse/HIVE-28911?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18085487#comment-18085487
 ] 

Ruben Q L commented on HIVE-28911:
----------------------------------

[~zabetak], [~soumyakanti.das]

I think what's happening is that, during HivePartitionPruneRule, we process a 
RexNode (which must have been already simplified, flattened, etc by Calcite) 
containing an AND with a SEARCH: {{AND( SEARCH(b, Sarg[(-∞..'four'), 
('four'..'one'), ('one'..+∞)]), <>(a, 22) )}}
(this is just a part, in the total filer this expression is combined with ... 
{{OR ((b) IN ('four', 'one') or (a = 22)) is null)}} )
This AND RexNode gets processed by ExprNodeConverter to convert it into the 
equivalent ExprNodeDesc, and at that point the SEARCH gets expanded by 
SearchTransformer#transform, with the proposed patch into a conjunction of 
NOT_EQUALS, i.e. {{AND(<>(b, 'four'), <>(b,'one'))}} ; which gives us the 
global, *unflattened* expression: {{AND( AND(<>(b, 'four'), <>(b,'one')), <>(a, 
22) )}} ; so this is converted into an unflattened ExprNodeDesc.
So we reach a partition key with an unflattened AND, and this is stored as 
partitionList in the corresponding RelOptHiveTable; therefore, during 
RelOptHiveTable#updateColStats, the colStatsCache that is built, uses the 
unflattened expression as key; and this is stored in the CompilationOpContext.

However, during the SematicAnalyzer process, we apply the following rules:
{code:java}
public static final RelOptRule PROJECT_SEARCH_EXPAND =
      new HiveRexShuttleTransformRule.Config().withRexShuttle(x -> new 
SearchTransformer.Shuttle(x, UNKNOWN))
          ...
  public static final RelOptRule FILTER_SEARCH_EXPAND =
      new HiveRexShuttleTransformRule.Config().withRexShuttle(x -> new 
SearchTransformer.Shuttle(x, FALSE))
          ....
  public static final RelOptRule JOIN_SEARCH_EXPAND =
      new HiveRexShuttleTransformRule.Config().withRexShuttle(x -> new 
SearchTransformer.Shuttle(x, FALSE))
          ...;
{code}
which apply SearchTransformer.Shuttle on a Project/Filter/Join RexNode (which 
may or may not contain SEARCH), to expand any potential SEARCH on it. But, this 
shuttle *also flattens AND/ORs* :
{code:java}
switch (call.getKind()) {
      // Flatten AND/OR operands.
      case OR:
        clonedOperands = visitList(call.operands, update);
        if (update[0]) {
          return RexUtil.composeDisjunction(rexBuilder, clonedOperands); // 
flattens
        } else {
          return call;
        }
      case AND:
        clonedOperands = visitList(call.operands, update);
        if (update[0]) {
          return RexUtil.composeConjunction(rexBuilder, clonedOperands); // 
flattens
        } else {
          return call;
        }
      case SEARCH:
        return new SearchTransformer<>(rexBuilder, call, 
this.unknownContext).transform();
{code}
I think so far we didn't have nested AND/ORs on search expansion, but with the 
NOT_EQUALS change, we may have them now, so this originates the mismatch. 
Afterwards, the PartitionPruner#transform returns a ParseContext with a 
opToPartPruner map containing the flattened expression.

Finally, in StatsRulesProcFactory#TableScanStatsRule#process, when trying to 
get the column stats for the partition, the partition key mismatch arises, so 
stats are not found (NONE).

Since we cannot remove the AND/OR flattening in SearchTransformer#Shuttle 
(otherwise we'd get an "AssertionError: RexUtil.isFlat should be true for 
condition" when trying to create a HiveFilter during CBO), the only solution to 
keep consistency is that all calls to SearchTransformer#transform must 
guarantee to flatten AND/ORs. This is already the case in 
SearchTransformer#Shuttle (as seen above), and 
HiveInBetweenExpandRule#RexInBetweenExpander (which contains similar code); but 
{*}NOT on ExprNodeConverter{*}, which leads to this problem.

Thus, I guess we must guarantee that [ExprNodeConverter#visitCall also 
flattens|github.com/apache/hive/pull/6503/commits/26351398c1a8f7537901293970decddf2275c3f7]
 (at least the ANDs for this particular problem, but probably the ORs too to be 
on the safe side) after search expansion.

> Improve SEARCH expansion to exploit <> operator 
> ------------------------------------------------
>
>                 Key: HIVE-28911
>                 URL: https://issues.apache.org/jira/browse/HIVE-28911
>             Project: Hive
>          Issue Type: Improvement
>          Components: CBO
>            Reporter: Stamatis Zampetakis
>            Assignee: Ruben Q L
>            Priority: Major
>              Labels: pull-request-available
>
> During various CBO transformations (especially during simplifications) the 
> internal SEARCH (CALCITE-4173) operator is introduced in the plan. The SEARCH 
> operator cannot be executed directly and must be expanded (using 
> SearchTransformer) to an equivalent form for further processing.
> The SEARCH operator can be used to represent many types of range predicates 
> including the inequality operator (<>).
> +Example+
> {code:sql}
> explain cbo
> select d_date_sk
> from date_dim
> where d_dom <> 10 and d_dom <> 20;
> {code}
> The intermediate plan before SEARCH expansion is shown below.
> {noformat}
> HiveProject(d_date_sk=[$0])
>   HiveFilter(condition=[SEARCH($9, Sarg[(-∞..10), (10..20), (20..+∞)])])
>     HiveTableScan(table=[[default, date_dim]], table:alias=[date_dim])
> {noformat}
> The two inequalities were converted to a Sarg with three ranges.
> The final plan after SEARCH expansion is shown below.
> {noformat}
> HiveProject(d_date_sk=[$0])
>   HiveFilter(condition=[OR(<($9, 10), >($9, 20), AND(>($9, 10), <($9, 20)))])
>     HiveTableScan(table=[[default, date_dim]], table:alias=[date_dim])
> {noformat}
> The conversion to ranges/Sarg is useful cause it allows us the optimizer to 
> perform much more powerful simplifications especially for complex predicates. 
> However, the expanded expression for this simple range is sub-optimal.
> Ideally, the final filter condition after expansion should be the following:
> {noformat}
> AND(<>($9, 10), <>($9, 20))
> {noformat}
> The goal of this ticket is to be able to exploit the inequality operator when 
> expanding ranges to generate simpler and slightly more efficient expressions.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to