[jira] [Work logged] (HIVE-25856) Intermittent null ordering in plans of queries with GROUP BY and LIMIT

ASF GitHub Bot (Jira) Tue, 11 Jan 2022 02:56:23 -0800


     [ 
https://issues.apache.org/jira/browse/HIVE-25856?focusedWorklogId=706773&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-706773
 ]


ASF GitHub Bot logged work on HIVE-25856:
-----------------------------------------

                Author: ASF GitHub Bot
            Created on: 11/Jan/22 10:55
            Start Date: 11/Jan/22 10:55
    Worklog Time Spent: 10m 
      Work Description: kasakrisz commented on a change in pull request #2932:
URL: https://github.com/apache/hive/pull/2932#discussion_r782032263



##########
File path: 
ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/rules/HiveAggregateSortLimitRule.java
##########
@@ -55,29 +54,13 @@
  */
 public class HiveAggregateSortLimitRule extends RelOptRule {
 
-  private static HiveAggregateSortLimitRule instance = null;
-
-  public static final HiveAggregateSortLimitRule getInstance(HiveConf 
hiveConf) {
-    if (instance == null) {
-      RelFieldCollation.NullDirection defaultAscNullDirection;
-      if (HiveConf.getBoolVar(hiveConf, 
HiveConf.ConfVars.HIVE_DEFAULT_NULLS_LAST)) {
-        defaultAscNullDirection = RelFieldCollation.NullDirection.LAST;
-      } else {
-        defaultAscNullDirection = RelFieldCollation.NullDirection.FIRST;
-      }
-      instance = new HiveAggregateSortLimitRule(defaultAscNullDirection);
-    }
-
-    return instance;
-  }
-
   private final RelFieldCollation.NullDirection defaultAscNullDirection;
 
-
-  private HiveAggregateSortLimitRule(RelFieldCollation.NullDirection 
defaultAscNullDirection) {
+  public HiveAggregateSortLimitRule(boolean nullsLast) {

Review comment:
       I generally avoid hardcoding because the null ordering behavior affects 
Top n key operator pushdown optimization.
   
https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/optimizer/topnkey/TopNKeyPushdownProcessor.java
   
   The `Top N Key` operator introduced into the physical plan as a parent 
operator of the `Reduce Sink`. It takes the sort keys and ordering parameters 
from the `Reduce Sink`. The push down optimization tries to move TNK until TS 
if possible.
   More complex queries may have more `Reduce Sinks` or even other TNKs which 
should be merged. This is the point where null ordering also count.
   
   I think it is safer to use the config.
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


Issue Time Tracking
-------------------

    Worklog Id:     (was: 706773)
    Time Spent: 0.5h  (was: 20m)

> Intermittent null ordering in plans of queries with GROUP BY and LIMIT
> ----------------------------------------------------------------------
>
>                 Key: HIVE-25856
>                 URL: https://issues.apache.org/jira/browse/HIVE-25856
>             Project: Hive
>          Issue Type: Bug
>          Components: CBO
>            Reporter: Stamatis Zampetakis
>            Assignee: Stamatis Zampetakis
>            Priority: Major
>              Labels: pull-request-available
>          Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> {code:sql}
> CREATE TABLE person (id INTEGER, country STRING);
> EXPLAIN CBO SELECT country, count(1) FROM person GROUP BY country LIMIT 5;
> {code}
> The {{EXPLAIN}} query produces a slightly different plan (ordering of nulls) 
> from one execution to another.
> {noformat}
> CBO PLAN:
> HiveSortLimit(sort0=[$1], dir0=[ASC-nulls-first], fetch=[5])
>   HiveProject(country=[$0], $f1=[$1])
>     HiveAggregate(group=[{1}], agg#0=[count()])
>       HiveTableScan(table=[[default, person]], table:alias=[person])
> {noformat}
> {noformat}
> CBO PLAN:
> HiveSortLimit(sort0=[$1], dir0=[ASC], fetch=[5])
>   HiveProject(country=[$0], $f1=[$1])
>     HiveAggregate(group=[{1}], agg#0=[count()])
>       HiveTableScan(table=[[default, person]], table:alias=[person])
> {noformat}
> This is unlikely to cause wrong results cause most aggregate functions (not 
> all) do not return nulls thus null ordering doesn't matter much but it can 
> lead to other problems such as:
> * intermittent CI failures
> * query/plan caching
> I bumped into this problem after investigating test failures in CI. The 
> following query in 
> [offset_limit_ppd_optimizer.q|https://github.com/apache/hive/blob/9cfdac44975bf38193de7449fc21b9536109daea/ql/src/test/queries/clientpositive/offset_limit_ppd_optimizer.q]
>  returns different plan when it runs individually and when it runs along with 
> some other qtest files.
> {code:sql}
> explain
> select * from
> (select key, count(1) from src group by key order by key limit 10,20) subq
> join
> (select key, count(1) from src group by key limit 20,20) subq2
> on subq.key=subq2.key limit 3,5;
> {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Work logged] (HIVE-25856) Intermittent null ordering in plans of queries with GROUP BY and LIMIT

Reply via email to