[jira] [Work logged] (HIVE-26312) Use default digest normalization strategy in CBO

ASF GitHub Bot (Jira) Fri, 26 Aug 2022 11:51:06 -0700


     [ 
https://issues.apache.org/jira/browse/HIVE-26312?focusedWorklogId=804056&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-804056
 ]


ASF GitHub Bot logged work on HIVE-26312:
-----------------------------------------

                Author: ASF GitHub Bot
            Created on: 26/Aug/22 18:50
            Start Date: 26/Aug/22 18:50
    Worklog Time Spent: 10m 
      Work Description: soumyakanti3578 commented on code in PR #3359:
URL: https://github.com/apache/hive/pull/3359#discussion_r956341300


##########
ql/src/test/results/clientpositive/llap/vector_interval_2.q.out:
##########
@@ -1350,16 +1350,16 @@ STAGE PLANS:
             Map Operator Tree:
                 TableScan
                   alias: vector_interval_2
-                  filterExpr: (CAST( str1 AS INTERVAL YEAR TO MONTH) is not 
null and ((dt + CAST( str1 AS INTERVAL YEAR TO MONTH)) = DATE'2002-03-01') and 
((dt + INTERVAL'1-2') = DATE'2002-03-01') and (dt <> (dt + CAST( str1 AS 
INTERVAL YEAR TO MONTH))) and (dt <> (dt + INTERVAL'1-2'))) (type: boolean)
+                  filterExpr: (((dt + CAST( str1 AS INTERVAL YEAR TO MONTH)) = 
DATE'2002-03-01') and ((dt + INTERVAL'1-2') = DATE'2002-03-01') and (dt <> (dt 
+ CAST( str1 AS INTERVAL YEAR TO MONTH))) and (dt <> (dt + INTERVAL'1-2'))) 
(type: boolean)

Review Comment:
   I think I figured out why `is not null` is missing here. I tried to 
reproduce this with the following test:
   
   > explain vectorization expression
   select ts from vector_interval_2
   where
     timestamp '2001-01-01 01:02:03' <= dt + interval '0 1:2:3' day to second
     and dt + interval '0 1:2:3' day to second = timestamp '2001-01-01 01:02:03'
     and dt + interval '0 1:2:3' day to second >= timestamp '2001-01-01 
01:02:03'
   order by ts;
   
   With default value for `CalciteSystemProperty.ENABLE_REX_DIGEST_NORMALIZE`, 
we can see 
[here](https://github.com/apache/calcite/blob/calcite-1.25.0/core/src/main/java/org/apache/calcite/rex/RexNormalize.java#L123)
 that `<, <=` are preferred over `>, >=`, i.e., if possible all `>, >=` are 
reversed. Now, the `AND` clause contains expressions like `(A<B) and (C<=D) and 
(B>A) and (D>=C)`, for example, just above this diff, there's the diff on line 
1172 which essentially has:
   >str3 <= str4 and str3 < str4 and str4 >= str3 and str4 > str3
   
   and we can see that the second half is pruned as it is exactly the same as 
the first half. These duplicate expressions are not getting added in the 
`digests` 
[here](https://github.com/apache/calcite/blob/calcite-1.25.0/core/src/main/java/org/apache/calcite/rex/RexUtil.java#L1142).
 `digests.add` eventually leads to  `RexNormalize.hashcode` method, which 
reverses the `>, >=` rexnodes and returns same hashcode as the `<, <=` 
expressions and thus they are not added to `digests`. 
   
   Now, when `CalciteSystemProperty.ENABLE_REX_DIGEST_NORMALIZE` is set to 
`false`, we get all these duplicate expressions which are not pruned, and in 
`RexSimplify.simplifyUsingPredicates` method, we calculate `range2` 
[here](https://github.com/apache/calcite/blob/calcite-1.25.0/core/src/main/java/org/apache/calcite/rex/RexSimplify.java#L1631),
 by calling `residue` and passing the list of predicates 
`predicates.pulledUpPredicates`. This list contains all the predicates that 
have been seen before the current predicate. So when there are duplicates and 
we see the same predicate for the second time, `residue` returns `Range.all()` 
from 
[here](https://github.com/apache/calcite/blob/calcite-1.25.0/core/src/main/java/org/apache/calcite/rex/RexSimplify.java#L1695),
 and this results in a `is not null` 
[here](https://github.com/apache/calcite/blob/calcite-1.25.0/core/src/main/java/org/apache/calcite/rex/RexSimplify.java#L1643).
   
   I think all the changes in this commit are fine as all of them are because 
of pruning of duplicate predicates. What do you think @zabetak ?





Issue Time Tracking
-------------------

    Worklog Id:     (was: 804056)
    Time Spent: 40m  (was: 0.5h)

> Use default digest normalization strategy in CBO
> ------------------------------------------------
>
>                 Key: HIVE-26312
>                 URL: https://issues.apache.org/jira/browse/HIVE-26312
>             Project: Hive
>          Issue Type: Task
>          Components: CBO
>    Affects Versions: 4.0.0-alpha-1
>            Reporter: Stamatis Zampetakis
>            Assignee: Stamatis Zampetakis
>            Priority: Major
>              Labels: pull-request-available
>          Time Spent: 40m
>  Remaining Estimate: 0h
>
> CALCITE-2450 introduced a way to improve planning time by normalizing some 
> query expressions (RexNodes). The behavior can be enabled/disabled via the 
> following system property: calcite.enable.rexnode.digest.normalize
> There was an attempt to disable the normalization explicitly in HIVE-23456 to 
> avoid rendering HiveFilterSortPredicates rule useless. However, the [way the 
> normalization is disabled 
> now|https://github.com/apache/hive/blob/f29cb2245c97102975ea0dd73783049eaa0947a0/ql/src/java/org/apache/hadoop/hive/ql/metadata/Hive.java#L549],
>  dependents on the way classes are loaded. If for some reason 
> CalciteSystemProperty is loaded before hitting the respective line in 
> Hive.java setting the property will not have any effect.
> After HIVE-26238 the behavior of the rule is not dependent in the value of 
> the property so there is nothing holding us back from enabling the 
> normalization.
> At the moment there is not strong reason to enable or disable the 
> normalization explicitly so it is better to rely on the default value 
> provided by Calcite to avoid running with different normalization strategy 
> when the class loading order changes.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Work logged] (HIVE-26312) Use default digest normalization strategy in CBO

Reply via email to