[jira] [Commented] (HIVE-28911) Improve SEARCH expansion to exploit <> operator

Ruben Q L (Jira) Wed, 27 May 2026 08:45:07 -0700


    [ 
https://issues.apache.org/jira/browse/HIVE-28911?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18083886#comment-18083886
 ]


Ruben Q L commented on HIVE-28911:
----------------------------------

I started to work on this task [here|https://github.com/apache/hive/pull/6503], 
and I observed a strange side effect in one test (TestIcebergCliDriver with 
update_iceberg_copy_on_write_partitioned.q): certain Column stats changed from 
"PARTIAL" to "NONE" (see test result 
[here|https://ci.hive.apache.org/blue/organizations/jenkins/hive-precommit/detail/PR-6503/2/tests/],
 and the adjustment 
[here|https://github.com/apache/hive/pull/6503/changes/952a039e15c7dfbbd60c3e4242212138f8e0db1e#diff-1a7cc0d8f098e94db572d11e5ffa727c0476c6eda9cfa262899ea3426d69bf5b]).

Apparently, the reason for that might be that the "new" expression 
{code}
(b <> 'four') and (b <> 'one')
{code}
combined with the rest of the filter expression:
{code}
filterExpr: (((b) IN ('four', 'one') or (a = 22)) is null or ((b <> 'four') and 
(b <> 'one') and (a <> 22))) (type: boolean)
{code}
cannot be pushed down, or leads to a "CNF expansion explosion" when doing the 
predicate push-down for Iceberg partition processing.

Whereas the "old" expression, didn't have such problem:
{code}
(b < 'four') or ((b > 'four') and (b < 'one')) or (b > 'one')
{code}
{code}
filterExpr: (((b) IN ('four', 'one') or (a = 22)) is null or (((b < 'four') or 
((b > 'four') and (b < 'one')) or (b > 'one')) and (a <> 22))) (type: boolean)
{code}

What to do about this? I can think of 3 approaches:
- Do nothing about it, and just assume this side effect.
- Adapt SearchTransformer modification, in case of one value generate {{x <> 
val}} , but in case of multiple values do not generate the problematic 
expression {{x <> val1 AND x <> val2 ... AND x <> valN}} and instead generate 
the equivalent {{NOT (x IN (val1, val2, ...valN))}} ; this NOT-IN approach does 
suffer from the "Iceberg predicate pushdown" problem, and the tests show that 
we keep the "Column stats: PARTIAL" (see 
[here|https://github.com/apache/hive/pull/6503/changes/d31ba37f6c96ce9cf4078db66716ec2dfe75ee44])
- Keep the AND - NOT_EQUALS expression in SearchTransformer, and at some point 
(couldn't find really where so far), during the Hive to Iceberg conversion, 
intercept the predicate, identify any potential AND of NOT_EQUALS expression, 
and convert it into a NOT IN equivalent. This seems a big ad-hoc, but I guess 
could do the trick.
- Other?

> Improve SEARCH expansion to exploit <> operator 
> ------------------------------------------------
>
>                 Key: HIVE-28911
>                 URL: https://issues.apache.org/jira/browse/HIVE-28911
>             Project: Hive
>          Issue Type: Improvement
>          Components: CBO
>            Reporter: Stamatis Zampetakis
>            Assignee: Ruben Q L
>            Priority: Major
>              Labels: pull-request-available
>
> During various CBO transformations (especially during simplifications) the 
> internal SEARCH (CALCITE-4173) operator is introduced in the plan. The SEARCH 
> operator cannot be executed directly and must be expanded (using 
> SearchTransformer) to an equivalent form for further processing.
> The SEARCH operator can be used to represent many types of range predicates 
> including the inequality operator (<>).
> +Example+
> {code:sql}
> explain cbo
> select d_date_sk
> from date_dim
> where d_dom <> 10 and d_dom <> 20;
> {code}
> The intermediate plan before SEARCH expansion is shown below.
> {noformat}
> HiveProject(d_date_sk=[$0])
>   HiveFilter(condition=[SEARCH($9, Sarg[(-∞..10), (10..20), (20..+∞)])])
>     HiveTableScan(table=[[default, date_dim]], table:alias=[date_dim])
> {noformat}
> The two inequalities were converted to a Sarg with three ranges.
> The final plan after SEARCH expansion is shown below.
> {noformat}
> HiveProject(d_date_sk=[$0])
>   HiveFilter(condition=[OR(<($9, 10), >($9, 20), AND(>($9, 10), <($9, 20)))])
>     HiveTableScan(table=[[default, date_dim]], table:alias=[date_dim])
> {noformat}
> The conversion to ranges/Sarg is useful cause it allows us the optimizer to 
> perform much more powerful simplifications especially for complex predicates. 
> However, the expanded expression for this simple range is sub-optimal.
> Ideally, the final filter condition after expansion should be the following:
> {noformat}
> AND(<>($9, 10), <>($9, 20))
> {noformat}
> The goal of this ticket is to be able to exploit the inequality operator when 
> expanding ranges to generate simpler and slightly more efficient expressions.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (HIVE-28911) Improve SEARCH expansion to exploit <> operator

Reply via email to