[
https://issues.apache.org/jira/browse/HIVE-26671?focusedWorklogId=820955&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-820955
]
ASF GitHub Bot logged work on HIVE-26671:
-----------------------------------------
Author: ASF GitHub Bot
Created on: 27/Oct/22 12:01
Start Date: 27/Oct/22 12:01
Worklog Time Spent: 10m
Work Description: kasakrisz commented on code in PR #3706:
URL: https://github.com/apache/hive/pull/3706#discussion_r1006475709
##########
ql/src/java/org/apache/hadoop/hive/ql/optimizer/topnkey/TopNKeyPushdownProcessor.java:
##########
@@ -347,14 +347,12 @@ private void pushdownThroughTopNKey(TopNKeyOperator
topNKey) throws SemanticExce
if (topNKeyDesc.getKeyColumns().size() == commonKeyPrefix.size()) {
// TNK keys are subset of the parent TNK keys
pushdownThroughParent(topNKey);
- if (topNKey.getChildOperators().get(0).getType() ==
OperatorType.TOPNKEY) {
- LOG.debug("Removing {} since child {} supersedes it",
parent.getName(), topNKey.getName());
-
topNKey.getParentOperators().get(0).removeChildAndAdoptItsChildren(topNKey);
- }
+ LOG.debug("Removing {} since child {} supersedes it",
parent.getName(), topNKey.getName());
+
parent.getParentOperators().get(0).removeChildAndAdoptItsChildren(parent);
Review Comment:
This does not seems to be a valid change because the previous statement is
`pushdownThroughParent(topNKey);`
Let assume we have the following operator subtree:
```
SEL
TNK[P] (key = a, b)
TNK[C] (key = a)
```
Two outcome exists depending on how far the
`pushdownThroughParent(topNKey);` can push down `TNK[C]`:
* `TNK[C]` pushed through `SEL`
```
TNK[C] (key = a)
SEL
TNK[P] (key = a, b)
```
We stop here since `SEL` should process rows filtered by `TNK[P] (key = a,
b)`
* `TNK[C]` pushed through only `TNK[P]`
```
SEL
TNK[C] (key = a)
TNK[P] (key = a, b)
```
In this case `TNK[C]` can be removed since `TNK[P]` filters out more rows
and there are no operators between the TNKs which would process the rows
filtered out by `TNK[P]`.
This is what the condition `(topNKey.getChildOperators().get(0).getType() ==
OperatorType.TOPNKEY)` checks and if it is true we do the remove:
```
SEL
TNK[P] (key = a, b)
```
##########
ql/src/test/results/clientpositive/llap/bucket_groupby.q.out:
##########
@@ -1753,9 +1753,9 @@ STAGE PLANS:
filterExpr: (ds = '103') (type: boolean)
Statistics: Num rows: 500 Data size: 89000 Basic stats:
COMPLETE Column stats: COMPLETE
Top N Key Operator
- sort order: ++
- keys: key (type: string), value (type: string)
- null sort order: zz
+ sort order: +
+ keys: key (type: string)
+ null sort order: z
Review Comment:
I think removing one of the key columns from TNK is too aggressive in this
case because both GBYs in the mapper and reducer has the same keys.
Issue Time Tracking
-------------------
Worklog Id: (was: 820955)
Time Spent: 0.5h (was: 20m)
> Incorrect results for group by/order by/limit query with 2 aggregates
> ---------------------------------------------------------------------
>
> Key: HIVE-26671
> URL: https://issues.apache.org/jira/browse/HIVE-26671
> Project: Hive
> Issue Type: Bug
> Components: Operators
> Reporter: Steve Carlin
> Assignee: Steve Carlin
> Priority: Major
> Labels: pull-request-available
> Time Spent: 0.5h
> Remaining Estimate: 0h
>
> Grabbed this query from the Impala test suite. It is a query run off of
> tpcds tables, but it's not really super special. You will need a lot of data
> to reproduce this, though.
> select
> l_orderkey,
> min(l_shipdate) as flt,
> count(distinct l_partkey) as cnl
> from lineitem
> group by l_orderkey order by l_orderkey limit 2;
> The issue is with the Top N Key operator optimizer. The Top N Key operator is
> the first operator after the Table Scan. The sort key is on both the
> l_orderkey and l_partkey columns, but this means that the second sort key
> might not be forwarded.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)