[
https://issues.apache.org/jira/browse/HIVE-24579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17414006#comment-17414006
]
Nemon Lou commented on HIVE-24579:
----------------------------------
After debuging,I find the bug is quite intuitive:
There is no order granted in the final result, but TopN in mapper filters out
part of the data. Causing incorrect aggragation result of some keys.For example:
Assume that key1 is in the top 10 key at first and then is squeezed by other
keys, but some data is still transmitted to the downstream. As a result, key1
obtains an incorrect summarization result in the reduce phase.
However, the final result is not obtained from the top 10 keys but from the
output results of multiple reduce. Therefore, key1 may be obtained, causing an
error in the final result.
> Incorrect Result For Groupby With Limit
> ---------------------------------------
>
> Key: HIVE-24579
> URL: https://issues.apache.org/jira/browse/HIVE-24579
> Project: Hive
> Issue Type: Bug
> Affects Versions: 2.3.7, 3.1.2, 4.0.0
> Reporter: Nemon Lou
> Priority: Major
>
> {code:sql}
> create table test(id int);
> explain extended select id,count(*) from test group by id limit 10;
> {code}
> There is an TopN unexpectly for map phase, which casues incorrect result.
> {code:sql}
> STAGE PLANS:
> Stage: Stage-1
> Tez
> DagId: root_20210104141527_c599c0cd-ca2f-4c7d-a3cc-3a01d65c49a1:5
> Edges:
> Reducer 2 <- Map 1 (SIMPLE_EDGE)
> DagName: root_20210104141527_c599c0cd-ca2f-4c7d-a3cc-3a01d65c49a1:5
> Vertices:
> Map 1
> Map Operator Tree:
> TableScan
> alias: test
> Statistics: Num rows: 1 Data size: 13500 Basic stats:
> COMPLETE Column stats: NONE
> GatherStats: false
> Select Operator
> expressions: id (type: int)
> outputColumnNames: id
> Statistics: Num rows: 1 Data size: 13500 Basic stats:
> COMPLETE Column stats: NONE
> Group By Operator
> aggregations: count()
> keys: id (type: int)
> mode: hash
> outputColumnNames: _col0, _col1
> Statistics: Num rows: 1 Data size: 13500 Basic stats:
> COMPLETE Column stats: NONE
> Reduce Output Operator
> key expressions: _col0 (type: int)
> null sort order: a
> sort order: +
> Map-reduce partition columns: _col0 (type: int)
> Statistics: Num rows: 1 Data size: 13500 Basic stats:
> COMPLETE Column stats: NONE
> tag: -1
> TopN: 10
> TopN Hash Memory Usage: 0.1
> value expressions: _col1 (type: bigint)
> auto parallelism: true
> Execution mode: vectorized
> Path -> Alias:
> file:/user/hive/warehouse/test [test]
> Path -> Partition:
> file:/user/hive/warehouse/test
> Partition
> base file name: test
> input format: org.apache.hadoop.mapred.TextInputFormat
> output format:
> org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
> properties:
> COLUMN_STATS_ACCURATE
> {"BASIC_STATS":"true","COLUMN_STATS":{"id":"true"}}
> bucket_count -1
> bucketing_version 2
> column.name.delimiter ,
> columns id
> columns.comments
> columns.types int
> file.inputformat org.apache.hadoop.mapred.TextInputFormat
> file.outputformat
> org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
> location file:/user/hive/warehouse/test
> name default.test
> numFiles 0
> numRows 0
> rawDataSize 0
> serialization.ddl struct test { i32 id}
> serialization.format 1
> serialization.lib
> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
> totalSize 0
> transient_lastDdlTime 1609730190
> serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
>
> input format: org.apache.hadoop.mapred.TextInputFormat
> output format:
> org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
> properties:
> COLUMN_STATS_ACCURATE
> {"BASIC_STATS":"true","COLUMN_STATS":{"id":"true"}}
> bucket_count -1
> bucketing_version 2
> column.name.delimiter ,
> columns id
> columns.comments
> columns.types int
> file.inputformat
> org.apache.hadoop.mapred.TextInputFormat
> file.outputformat
> org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
> location file:/user/hive/warehouse/test
> name default.test
> numFiles 0
> numRows 0
> rawDataSize 0
> serialization.ddl struct test { i32 id}
> serialization.format 1
> serialization.lib
> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
> totalSize 0
> transient_lastDdlTime 1609730190
> serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
> name: default.test
> name: default.test
> Truncated Path -> Alias:
> /test [test]
> Reducer 2
> Execution mode: vectorized
> Needs Tagging: false
> Reduce Operator Tree:
> Group By Operator
> aggregations: count(VALUE._col0)
> keys: KEY._col0 (type: int)
> mode: mergepartial
> outputColumnNames: _col0, _col1
> Statistics: Num rows: 1 Data size: 13500 Basic stats:
> COMPLETE Column stats: NONE
> Limit
> Number of rows: 10
> Statistics: Num rows: 1 Data size: 13500 Basic stats:
> COMPLETE Column stats: NONE
> File Output Operator
> compressed: false
> GlobalTableId: 0
> directory:
> file:/tmp/root/7160ea24-52b9-47c3-aafc-c9200263a1c6/hive_2021-01-04_14-15-27_601_190083924675700904-1/-mr-10001/.hive-staging_hive_2021-01-04_14-15-27_601_190083924675700904-1/-ext-10002
> NumFilesPerFileSink: 1
> Statistics: Num rows: 1 Data size: 13500 Basic stats:
> COMPLETE Column stats: NONE
> Stats Publishing Key Prefix:
> file:/tmp/root/7160ea24-52b9-47c3-aafc-c9200263a1c6/hive_2021-01-04_14-15-27_601_190083924675700904-1/-mr-10001/.hive-staging_hive_2021-01-04_14-15-27_601_190083924675700904-1/-ext-10002/
> table:
> input format:
> org.apache.hadoop.mapred.SequenceFileInputFormat
> output format:
> org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
> properties:
> columns _col0,_col1
> columns.types int:bigint
> escape.delim \
> hive.serialization.extend.additional.nesting.levels
> true
> serialization.escape.crlf true
> serialization.format 1
> serialization.lib
> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
> serde:
> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
> TotalFiles: 1
> GatherStats: false
> MultiFileSpray: false
> Stage: Stage-0
> Fetch Operator
> limit: 10
> Processor Tree:
> ListSink
> Time taken: 0.102 seconds, Fetched: 143 row(s)
> {code}
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)