[
https://issues.apache.org/jira/browse/HIVE-28911?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18084125#comment-18084125
]
Stamatis Zampetakis commented on HIVE-28911:
--------------------------------------------
As far as I understand the main question is which expression pattern we prefer
among:
||ID||Expression||CNF||DNF||
|1|NOT (x IN (v1, v2, ..., vn))|YES|YES|
|2|NOT( (x=v1) OR (x=v2) OR ... OR (x=vn))|NO|NO|
|3|(x <> v1) AND (x <> v2) AND ... AND (x <> vn)|YES|NO|
|4|(x < v1) OR ((x > v1) AND (x < v2)) OR ... OR (x > vn) |NO|YES|
Since you mentioned something about implications of CNF expansion I included a
CNF/DNF characterization in the table.
I have the impression that the pattern E1 does not appear anywhere in Hive at
the moment neither at the logical (RelNode) plan nor at the physical (Operator)
plan. In addition, the transformation is somewhat the opposite of
[RexSimplify#simplifyNot|https://github.com/apache/calcite/blob/c1550dedec6693c01a99d1d2a4a83ff56f6711e4/core/src/main/java/org/apache/calcite/rex/RexSimplify.java#L925]
so re-introducing the NOT IN seems like we are "cancelling" that
simplification.
On the positive side it the pattern E1 has some nice properties since it is
both in CNF and DNF, it is compact, and simpler than all the rest, and possibly
the most efficient of all.
I am pretty sure we had some discussions about this transformation with
[~soumyakanti.das] in HIVE-27102 but cannot find traces of that discussions or
the respective commits. I was stupid enough to squash and force push a history
of over 100 commits before merging the PR for HIVE-27102 to master without
keeping a backup branch so some observations that we had at this point are
probably lost forever. If we are lucky maybe [~soumyakanti.das] will have some
archives/backups.
I am leaning towards pattern 1, although I would like to ensure that by going
into this direction we are not gonna miss simplifications or going into
infinite cycles of adding and removing NOT IN. Moreover, I would like to
understand a bit better why/where pattern 3 leads to CNF explosion and how come
pattern 4 is not affected.
> Improve SEARCH expansion to exploit <> operator
> ------------------------------------------------
>
> Key: HIVE-28911
> URL: https://issues.apache.org/jira/browse/HIVE-28911
> Project: Hive
> Issue Type: Improvement
> Components: CBO
> Reporter: Stamatis Zampetakis
> Assignee: Ruben Q L
> Priority: Major
> Labels: pull-request-available
>
> During various CBO transformations (especially during simplifications) the
> internal SEARCH (CALCITE-4173) operator is introduced in the plan. The SEARCH
> operator cannot be executed directly and must be expanded (using
> SearchTransformer) to an equivalent form for further processing.
> The SEARCH operator can be used to represent many types of range predicates
> including the inequality operator (<>).
> +Example+
> {code:sql}
> explain cbo
> select d_date_sk
> from date_dim
> where d_dom <> 10 and d_dom <> 20;
> {code}
> The intermediate plan before SEARCH expansion is shown below.
> {noformat}
> HiveProject(d_date_sk=[$0])
> HiveFilter(condition=[SEARCH($9, Sarg[(-∞..10), (10..20), (20..+∞)])])
> HiveTableScan(table=[[default, date_dim]], table:alias=[date_dim])
> {noformat}
> The two inequalities were converted to a Sarg with three ranges.
> The final plan after SEARCH expansion is shown below.
> {noformat}
> HiveProject(d_date_sk=[$0])
> HiveFilter(condition=[OR(<($9, 10), >($9, 20), AND(>($9, 10), <($9, 20)))])
> HiveTableScan(table=[[default, date_dim]], table:alias=[date_dim])
> {noformat}
> The conversion to ranges/Sarg is useful cause it allows us the optimizer to
> perform much more powerful simplifications especially for complex predicates.
> However, the expanded expression for this simple range is sub-optimal.
> Ideally, the final filter condition after expansion should be the following:
> {noformat}
> AND(<>($9, 10), <>($9, 20))
> {noformat}
> The goal of this ticket is to be able to exploit the inequality operator when
> expanding ranges to generate simpler and slightly more efficient expressions.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)