[ 
https://issues.apache.org/jira/browse/HIVE-28911?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18084387#comment-18084387
 ] 

Ruben Q L commented on HIVE-28911:
----------------------------------

I think I just found something while debugging: looks like this is not a 
CNF-related problem, but rather a mismatch on partition key homogenization.

When using the conjunction of {{{}NOT_EQUALS{}}}, it seems the column stats 
cannot be retrieved from the cache in {{ParseContext#getColStatsCached}} 
because the method is called with this PrunedPartitionList key:
{noformat}
default.tbl_ice;(((b <> 'four') and (b <> 'one') and (a <> 22)) or ((b) IN 
('four', 'one') or (a = 22)) is null)
{noformat}
And in the context's {{ctx.getOpContext().getColStatsCache()}} cache there's a 
bucket for it, but the key is slightly different (notice the extra parenthesis)
{noformat}
default.tbl_ice;((((b <> 'four') and (b <> 'one')) and (a <> 22)) or ((b) IN 
('four', 'one') or (a = 22)) is null)
                  ^                             ^  
{noformat}
so this is a cache miss, stats are not found in this case, so they need to be 
estimated, and this leads to a cascade effect that results in the ColStats NONE 
that we see in the final output!
Switching to a NOT_IN approach seems to avoid the problem, same with the 
original expression {{(b < 'four') or ((b > 'four') and (b < 'one')) or (b > 
'one')}}

I'll try to identify where this key mismatch originates...

> Improve SEARCH expansion to exploit <> operator 
> ------------------------------------------------
>
>                 Key: HIVE-28911
>                 URL: https://issues.apache.org/jira/browse/HIVE-28911
>             Project: Hive
>          Issue Type: Improvement
>          Components: CBO
>            Reporter: Stamatis Zampetakis
>            Assignee: Ruben Q L
>            Priority: Major
>              Labels: pull-request-available
>
> During various CBO transformations (especially during simplifications) the 
> internal SEARCH (CALCITE-4173) operator is introduced in the plan. The SEARCH 
> operator cannot be executed directly and must be expanded (using 
> SearchTransformer) to an equivalent form for further processing.
> The SEARCH operator can be used to represent many types of range predicates 
> including the inequality operator (<>).
> +Example+
> {code:sql}
> explain cbo
> select d_date_sk
> from date_dim
> where d_dom <> 10 and d_dom <> 20;
> {code}
> The intermediate plan before SEARCH expansion is shown below.
> {noformat}
> HiveProject(d_date_sk=[$0])
>   HiveFilter(condition=[SEARCH($9, Sarg[(-∞..10), (10..20), (20..+∞)])])
>     HiveTableScan(table=[[default, date_dim]], table:alias=[date_dim])
> {noformat}
> The two inequalities were converted to a Sarg with three ranges.
> The final plan after SEARCH expansion is shown below.
> {noformat}
> HiveProject(d_date_sk=[$0])
>   HiveFilter(condition=[OR(<($9, 10), >($9, 20), AND(>($9, 10), <($9, 20)))])
>     HiveTableScan(table=[[default, date_dim]], table:alias=[date_dim])
> {noformat}
> The conversion to ranges/Sarg is useful cause it allows us the optimizer to 
> perform much more powerful simplifications especially for complex predicates. 
> However, the expanded expression for this simple range is sub-optimal.
> Ideally, the final filter condition after expansion should be the following:
> {noformat}
> AND(<>($9, 10), <>($9, 20))
> {noformat}
> The goal of this ticket is to be able to exploit the inequality operator when 
> expanding ranges to generate simpler and slightly more efficient expressions.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to