[
https://issues.apache.org/jira/browse/HIVE-26671?focusedWorklogId=820962&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-820962
]
ASF GitHub Bot logged work on HIVE-26671:
-----------------------------------------
Author: ASF GitHub Bot
Created on: 27/Oct/22 12:16
Start Date: 27/Oct/22 12:16
Worklog Time Spent: 10m
Work Description: kasakrisz commented on PR #3706:
URL: https://github.com/apache/hive/pull/3706#issuecomment-1293439464
Thanks @scarlin-cloudera for investigating this issue. This patch is a
possible solution.
I would like to share another approach: IIUC the issues is caused by the
extra key column because of the distinct in the RS located in the mapper.
https://github.com/apache/hive/blob/16ce75578c265d0aaba7eedafb65658fc569f75e/ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java#L5753
Without TNK the plan of the query mentioned in the jira looks like this:
```
Map
TS
SEL
GBY (l_orderkey, l_partkey)
RS (l_orderkey, l_partkey)
Reduce
GBY (KEY._col0)
RS (col0)
...
```
A TNK is created on top of each RS and the keys are coming from the
corresponding RS then both TNKs pushed until TS and at TNK merging the one with
2 keys are accepted.
How about skipping TNK creation if RS has keys defined because of distinct
in `TopNKeyProcessor`
https://github.com/apache/hive/blob/16ce75578c265d0aaba7eedafb65658fc569f75e/ql/src/java/org/apache/hadoop/hive/ql/plan/ReduceSinkDesc.java#L424-L426
and keep the existing behavior when no distinct aggregates present.
I would expect that only TNK (l_orderkey) remains.
What do you think?
Issue Time Tracking
-------------------
Worklog Id: (was: 820962)
Time Spent: 40m (was: 0.5h)
> Incorrect results for group by/order by/limit query with 2 aggregates
> ---------------------------------------------------------------------
>
> Key: HIVE-26671
> URL: https://issues.apache.org/jira/browse/HIVE-26671
> Project: Hive
> Issue Type: Bug
> Components: Operators
> Reporter: Steve Carlin
> Assignee: Steve Carlin
> Priority: Major
> Labels: pull-request-available
> Time Spent: 40m
> Remaining Estimate: 0h
>
> Grabbed this query from the Impala test suite. It is a query run off of
> tpcds tables, but it's not really super special. You will need a lot of data
> to reproduce this, though.
> select
> l_orderkey,
> min(l_shipdate) as flt,
> count(distinct l_partkey) as cnl
> from lineitem
> group by l_orderkey order by l_orderkey limit 2;
> The issue is with the Top N Key operator optimizer. The Top N Key operator is
> the first operator after the Table Scan. The sort key is on both the
> l_orderkey and l_partkey columns, but this means that the second sort key
> might not be forwarded.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)