[
https://issues.apache.org/jira/browse/HIVE-24579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17418562#comment-17418562
]
Nemon Lou commented on HIVE-24579:
----------------------------------
[~kkasa] Good job!
After reading your PR, I have some concerns.
1. Does the sorting stage cause compatibility problems? For example, the
returned content is different from the original after sort.
(There are many examples in the .q.out file). This seems to be less of a
problem than the incorrect result.
2. Faster by topn + order by, or faster by reducing one stage (no topn + no
order by)? Do different solutions need to be selected for different scenarios?
3. In the scenario where cbo=false, do we need to fix it?
Thanks.
> Incorrect Result For Groupby With Limit
> ---------------------------------------
>
> Key: HIVE-24579
> URL: https://issues.apache.org/jira/browse/HIVE-24579
> Project: Hive
> Issue Type: Bug
> Components: Physical Optimizer
> Affects Versions: 2.3.7, 3.1.2, 4.0.0
> Reporter: Nemon Lou
> Assignee: Krisztian Kasa
> Priority: Major
> Labels: pull-request-available
> Fix For: 4.0.0
>
> Time Spent: 10m
> Remaining Estimate: 0h
>
> {code:sql}
> create table test(id int);
> explain extended select id,count(*) from test group by id limit 10;
> {code}
> There is an TopN unexpectly for map phase, which casues incorrect result.
> {code:sql}
> STAGE PLANS:
> Stage: Stage-1
> Tez
> DagId: root_20210104141527_c599c0cd-ca2f-4c7d-a3cc-3a01d65c49a1:5
> Edges:
> Reducer 2 <- Map 1 (SIMPLE_EDGE)
> DagName: root_20210104141527_c599c0cd-ca2f-4c7d-a3cc-3a01d65c49a1:5
> Vertices:
> Map 1
> Map Operator Tree:
> TableScan
> alias: test
> Statistics: Num rows: 1 Data size: 13500 Basic stats:
> COMPLETE Column stats: NONE
> GatherStats: false
> Select Operator
> expressions: id (type: int)
> outputColumnNames: id
> Statistics: Num rows: 1 Data size: 13500 Basic stats:
> COMPLETE Column stats: NONE
> Group By Operator
> aggregations: count()
> keys: id (type: int)
> mode: hash
> outputColumnNames: _col0, _col1
> Statistics: Num rows: 1 Data size: 13500 Basic stats:
> COMPLETE Column stats: NONE
> Reduce Output Operator
> key expressions: _col0 (type: int)
> null sort order: a
> sort order: +
> Map-reduce partition columns: _col0 (type: int)
> Statistics: Num rows: 1 Data size: 13500 Basic stats:
> COMPLETE Column stats: NONE
> tag: -1
> TopN: 10
> TopN Hash Memory Usage: 0.1
> value expressions: _col1 (type: bigint)
> auto parallelism: true
> Execution mode: vectorized
> Path -> Alias:
> file:/user/hive/warehouse/test [test]
> Path -> Partition:
> file:/user/hive/warehouse/test
> Partition
> base file name: test
> input format: org.apache.hadoop.mapred.TextInputFormat
> output format:
> org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
> properties:
> COLUMN_STATS_ACCURATE
> {"BASIC_STATS":"true","COLUMN_STATS":{"id":"true"}}
> bucket_count -1
> bucketing_version 2
> column.name.delimiter ,
> columns id
> columns.comments
> columns.types int
> file.inputformat org.apache.hadoop.mapred.TextInputFormat
> file.outputformat
> org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
> location file:/user/hive/warehouse/test
> name default.test
> numFiles 0
> numRows 0
> rawDataSize 0
> serialization.ddl struct test { i32 id}
> serialization.format 1
> serialization.lib
> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
> totalSize 0
> transient_lastDdlTime 1609730190
> serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
>
> input format: org.apache.hadoop.mapred.TextInputFormat
> output format:
> org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
> properties:
> COLUMN_STATS_ACCURATE
> {"BASIC_STATS":"true","COLUMN_STATS":{"id":"true"}}
> bucket_count -1
> bucketing_version 2
> column.name.delimiter ,
> columns id
> columns.comments
> columns.types int
> file.inputformat
> org.apache.hadoop.mapred.TextInputFormat
> file.outputformat
> org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
> location file:/user/hive/warehouse/test
> name default.test
> numFiles 0
> numRows 0
> rawDataSize 0
> serialization.ddl struct test { i32 id}
> serialization.format 1
> serialization.lib
> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
> totalSize 0
> transient_lastDdlTime 1609730190
> serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
> name: default.test
> name: default.test
> Truncated Path -> Alias:
> /test [test]
> Reducer 2
> Execution mode: vectorized
> Needs Tagging: false
> Reduce Operator Tree:
> Group By Operator
> aggregations: count(VALUE._col0)
> keys: KEY._col0 (type: int)
> mode: mergepartial
> outputColumnNames: _col0, _col1
> Statistics: Num rows: 1 Data size: 13500 Basic stats:
> COMPLETE Column stats: NONE
> Limit
> Number of rows: 10
> Statistics: Num rows: 1 Data size: 13500 Basic stats:
> COMPLETE Column stats: NONE
> File Output Operator
> compressed: false
> GlobalTableId: 0
> directory:
> file:/tmp/root/7160ea24-52b9-47c3-aafc-c9200263a1c6/hive_2021-01-04_14-15-27_601_190083924675700904-1/-mr-10001/.hive-staging_hive_2021-01-04_14-15-27_601_190083924675700904-1/-ext-10002
> NumFilesPerFileSink: 1
> Statistics: Num rows: 1 Data size: 13500 Basic stats:
> COMPLETE Column stats: NONE
> Stats Publishing Key Prefix:
> file:/tmp/root/7160ea24-52b9-47c3-aafc-c9200263a1c6/hive_2021-01-04_14-15-27_601_190083924675700904-1/-mr-10001/.hive-staging_hive_2021-01-04_14-15-27_601_190083924675700904-1/-ext-10002/
> table:
> input format:
> org.apache.hadoop.mapred.SequenceFileInputFormat
> output format:
> org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
> properties:
> columns _col0,_col1
> columns.types int:bigint
> escape.delim \
> hive.serialization.extend.additional.nesting.levels
> true
> serialization.escape.crlf true
> serialization.format 1
> serialization.lib
> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
> serde:
> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
> TotalFiles: 1
> GatherStats: false
> MultiFileSpray: false
> Stage: Stage-0
> Fetch Operator
> limit: 10
> Processor Tree:
> ListSink
> Time taken: 0.102 seconds, Fetched: 143 row(s)
> {code}
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)