[
https://issues.apache.org/jira/browse/HIVE-24579?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Nemon Lou updated HIVE-24579:
-----------------------------
Description:
{code:sql}
create table test(id int);
explain extended select id,count(*) from test group by id limit 10;
{code}
There is an TopN unexpectly for map phase, which casues incorrect result.
{code:sql}
STAGE DEPENDENCIES:
Stage-1 is a root stage
Stage-0 depends on stages: Stage-1
STAGE PLANS:
Stage: Stage-1
Tez
DagId: root_20210104113831_2451d621-8f77-4a29-9da6-3a65bc4d9e56:2
Edges:
Reducer 2 <- Map 1 (SIMPLE_EDGE)
DagName: root_20210104113831_2451d621-8f77-4a29-9da6-3a65bc4d9e56:2
Vertices:
Map 1
Map Operator Tree:
TableScan
alias: test
Statistics: Num rows: 1 Data size: 13500 Basic stats:
COMPLETE Column stats: NONE
GatherStats: false
Select Operator
expressions: id (type: int)
outputColumnNames: id
Statistics: Num rows: 1 Data size: 13500 Basic stats:
COMPLETE Column stats: NONE
Group By Operator
aggregations: count()
keys: id (type: int)
mode: hash
outputColumnNames: _col0, _col1
Statistics: Num rows: 1 Data size: 13500 Basic stats:
COMPLETE Column stats: NONE
Reduce Output Operator
key expressions: _col0 (type: int)
null sort order: a
sort order: +
Map-reduce partition columns: _col0 (type: int)
Statistics: Num rows: 1 Data size: 13500 Basic stats:
COMPLETE Column stats: NONE
tag: -1
value expressions: _col1 (type: bigint)
auto parallelism: true
Execution mode: vectorized
Path -> Alias:
file:/user/hive/warehouse/test [test]
Path -> Partition:
file:/user/hive/warehouse/test
Partition
base file name: test
input format: org.apache.hadoop.mapred.TextInputFormat
output format:
org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
properties:
COLUMN_STATS_ACCURATE
{"BASIC_STATS":"true","COLUMN_STATS":{"id":"true"}}
bucket_count -1
bucketing_version 2
column.name.delimiter ,
columns id
columns.comments
columns.types int
file.inputformat org.apache.hadoop.mapred.TextInputFormat
file.outputformat
org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
location file:/user/hive/warehouse/test
name default.test
numFiles 0
numRows 0
rawDataSize 0
serialization.ddl struct test { i32 id}
serialization.format 1
serialization.lib
org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
totalSize 0
transient_lastDdlTime 1609730190
serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
input format: org.apache.hadoop.mapred.TextInputFormat
output format:
org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
properties:
COLUMN_STATS_ACCURATE
{"BASIC_STATS":"true","COLUMN_STATS":{"id":"true"}}
bucket_count -1
bucketing_version 2
column.name.delimiter ,
columns id
columns.comments
columns.types int
file.inputformat org.apache.hadoop.mapred.TextInputFormat
file.outputformat
org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
location file:/user/hive/warehouse/test
name default.test
numFiles 0
numRows 0
rawDataSize 0
serialization.ddl struct test { i32 id}
serialization.format 1
serialization.lib
org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
totalSize 0
transient_lastDdlTime 1609730190
serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
name: default.test
name: default.test
Truncated Path -> Alias:
/test [test]
Reducer 2
Execution mode: vectorized
Needs Tagging: false
Reduce Operator Tree:
Group By Operator
aggregations: count(VALUE._col0)
keys: KEY._col0 (type: int)
mode: mergepartial
outputColumnNames: _col0, _col1
Statistics: Num rows: 1 Data size: 13500 Basic stats: COMPLETE
Column stats: NONE
Limit
Number of rows: 10
Statistics: Num rows: 1 Data size: 13500 Basic stats:
COMPLETE Column stats: NONE
File Output Operator
compressed: false
GlobalTableId: 0
directory:
file:/tmp/root/7160ea24-52b9-47c3-aafc-c9200263a1c6/hive_2021-01-04_11-38-31_601_4363062670409846390-1/-mr-10001/.hive-staging_hive_2021-01-04_11-38-31_601_4363062670409846390-1/-ext-10002
NumFilesPerFileSink: 1
Statistics: Num rows: 1 Data size: 13500 Basic stats:
COMPLETE Column stats: NONE
Stats Publishing Key Prefix:
file:/tmp/root/7160ea24-52b9-47c3-aafc-c9200263a1c6/hive_2021-01-04_11-38-31_601_4363062670409846390-1/-mr-10001/.hive-staging_hive_2021-01-04_11-38-31_601_4363062670409846390-1/-ext-10002/
table:
input format:
org.apache.hadoop.mapred.SequenceFileInputFormat
output format:
org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
properties:
columns _col0,_col1
columns.types int:bigint
escape.delim \
hive.serialization.extend.additional.nesting.levels
true
serialization.escape.crlf true
serialization.format 1
serialization.lib
org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
serde:
org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
TotalFiles: 1
GatherStats: false
MultiFileSpray: false
Stage: Stage-0
Fetch Operator
limit: 10
Processor Tree:
ListSink
Time taken: 0.111 seconds, Fetched: 141 row(s)
{code}
was:
{code:sql}
create table test(id int);
explain extended select id,count(*) from test group by id limit 10;
{code}
There is an TopN unexpectly for map phase, which casues incorrect result.
{code:sql}
STAGE PLANS:
Stage: Stage-1
Map Reduce
Map Operator Tree:
TableScan
alias: test
Statistics: Num rows: 337 Data size: 1350 Basic stats: COMPLETE Column stats:
NONE
GatherStats: false
Select Operator
expressions: id (type: int)
outputColumnNames: id
Statistics: Num rows: 337 Data size: 1350 Basic stats: COMPLETE Column stats:
NONE
Group By Operator
aggregations: count()
keys: id (type: int)
mode: hash
outputColumnNames: _col0, _col1
Statistics: Num rows: 337 Data size: 1350 Basic stats: COMPLETE Column stats:
NONE
Reduce Output Operator
key expressions: _col0 (type: int)
null sort order: a
sort order: +
Map-reduce partition columns: _col0 (type: int)
Statistics: Num rows: 337 Data size: 1350 Basic stats: COMPLETE Column stats:
NONE
tag: -1
TopN: 10
TopN Hash Memory Usage: 0.1
value expressions: _col1 (type: bigint)
auto parallelism: false
Path -> Alias:
file:/user/hive/warehouse/test [test]
Path -> Partition:
file:/user/hive/warehouse/test
Partition
base file name: test
input format: org.apache.hadoop.mapred.TextInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
properties:
COLUMN_STATS_ACCURATE \{"BASIC_STATS":"true"}
bucket_count -1
column.name.delimiter ,
columns id
columns.comments
columns.types int
file.inputformat org.apache.hadoop.mapred.TextInputFormat
file.outputformat org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
location file:/user/hive/warehouse/test
name default.test
numFiles 0
numRows 0
rawDataSize 0
serialization.ddl struct test \{ i32 id}
serialization.format 1
serialization.lib org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
totalSize 0
transient_lastDdlTime 1609730036
serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
input format: org.apache.hadoop.mapred.TextInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
properties:
COLUMN_STATS_ACCURATE \{"BASIC_STATS":"true"}
bucket_count -1
column.name.delimiter ,
columns id
columns.comments
columns.types int
file.inputformat org.apache.hadoop.mapred.TextInputFormat
file.outputformat org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
location file:/user/hive/warehouse/test
name default.test
numFiles 0
numRows 0
rawDataSize 0
serialization.ddl struct test \{ i32 id}
serialization.format 1
serialization.lib org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
totalSize 0
transient_lastDdlTime 1609730036
serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
name: default.test
name: default.test
Truncated Path -> Alias:
/test [test]
Needs Tagging: false
Reduce Operator Tree:
Group By Operator
aggregations: count(VALUE._col0)
keys: KEY._col0 (type: int)
mode: mergepartial
outputColumnNames: _col0, _col1
Statistics: Num rows: 168 Data size: 672 Basic stats: COMPLETE Column stats:
NONE
Limit
Number of rows: 10
Statistics: Num rows: 10 Data size: 40 Basic stats: COMPLETE Column stats: NONE
File Output Operator
compressed: false
GlobalTableId: 0
directory:
file:/tmp/root/bd08973b-b58c-4185-9072-c1891f67878d/hive_2021-01-04_11-14-01_745_4475755683092435506-1/-mr-10001/.hive-staging_hive_2021-01-04_11-14-01_745_4475755683092435506-1/-ext-10002
NumFilesPerFileSink: 1
Statistics: Num rows: 10 Data size: 40 Basic stats: COMPLETE Column stats: NONE
Stats Publishing Key Prefix:
file:/tmp/root/bd08973b-b58c-4185-9072-c1891f67878d/hive_2021-01-04_11-14-01_745_4475755683092435506-1/-mr-10001/.hive-staging_hive_2021-01-04_11-14-01_745_4475755683092435506-1/-ext-10002/
table:
input format: org.apache.hadoop.mapred.SequenceFileInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
properties:
columns _col0,_col1
columns.types int:bigint
escape.delim \
hive.serialization.extend.additional.nesting.levels true
serialization.escape.crlf true
serialization.format 1
serialization.lib org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
TotalFiles: 1
GatherStats: false
MultiFileSpray: false
Stage: Stage-0
Fetch Operator
limit: 10
Processor Tree:
ListSink
Time taken: 1.877 seconds, Fetched: 128 row(s)
{code}
> Incorrect Result For Groupby With Limit
> ---------------------------------------
>
> Key: HIVE-24579
> URL: https://issues.apache.org/jira/browse/HIVE-24579
> Project: Hive
> Issue Type: Bug
> Affects Versions: 2.3.7, 3.1.2, 4.0.0
> Reporter: Nemon Lou
> Priority: Critical
>
> {code:sql}
> create table test(id int);
> explain extended select id,count(*) from test group by id limit 10;
> {code}
> There is an TopN unexpectly for map phase, which casues incorrect result.
> {code:sql}
> STAGE DEPENDENCIES:
> Stage-1 is a root stage
> Stage-0 depends on stages: Stage-1
> STAGE PLANS:
> Stage: Stage-1
> Tez
> DagId: root_20210104113831_2451d621-8f77-4a29-9da6-3a65bc4d9e56:2
> Edges:
> Reducer 2 <- Map 1 (SIMPLE_EDGE)
> DagName: root_20210104113831_2451d621-8f77-4a29-9da6-3a65bc4d9e56:2
> Vertices:
> Map 1
> Map Operator Tree:
> TableScan
> alias: test
> Statistics: Num rows: 1 Data size: 13500 Basic stats:
> COMPLETE Column stats: NONE
> GatherStats: false
> Select Operator
> expressions: id (type: int)
> outputColumnNames: id
> Statistics: Num rows: 1 Data size: 13500 Basic stats:
> COMPLETE Column stats: NONE
> Group By Operator
> aggregations: count()
> keys: id (type: int)
> mode: hash
> outputColumnNames: _col0, _col1
> Statistics: Num rows: 1 Data size: 13500 Basic stats:
> COMPLETE Column stats: NONE
> Reduce Output Operator
> key expressions: _col0 (type: int)
> null sort order: a
> sort order: +
> Map-reduce partition columns: _col0 (type: int)
> Statistics: Num rows: 1 Data size: 13500 Basic stats:
> COMPLETE Column stats: NONE
> tag: -1
> value expressions: _col1 (type: bigint)
> auto parallelism: true
> Execution mode: vectorized
> Path -> Alias:
> file:/user/hive/warehouse/test [test]
> Path -> Partition:
> file:/user/hive/warehouse/test
> Partition
> base file name: test
> input format: org.apache.hadoop.mapred.TextInputFormat
> output format:
> org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
> properties:
> COLUMN_STATS_ACCURATE
> {"BASIC_STATS":"true","COLUMN_STATS":{"id":"true"}}
> bucket_count -1
> bucketing_version 2
> column.name.delimiter ,
> columns id
> columns.comments
> columns.types int
> file.inputformat org.apache.hadoop.mapred.TextInputFormat
> file.outputformat
> org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
> location file:/user/hive/warehouse/test
> name default.test
> numFiles 0
> numRows 0
> rawDataSize 0
> serialization.ddl struct test { i32 id}
> serialization.format 1
> serialization.lib
> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
> totalSize 0
> transient_lastDdlTime 1609730190
> serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
>
> input format: org.apache.hadoop.mapred.TextInputFormat
> output format:
> org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
> properties:
> COLUMN_STATS_ACCURATE
> {"BASIC_STATS":"true","COLUMN_STATS":{"id":"true"}}
> bucket_count -1
> bucketing_version 2
> column.name.delimiter ,
> columns id
> columns.comments
> columns.types int
> file.inputformat
> org.apache.hadoop.mapred.TextInputFormat
> file.outputformat
> org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
> location file:/user/hive/warehouse/test
> name default.test
> numFiles 0
> numRows 0
> rawDataSize 0
> serialization.ddl struct test { i32 id}
> serialization.format 1
> serialization.lib
> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
> totalSize 0
> transient_lastDdlTime 1609730190
> serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
> name: default.test
> name: default.test
> Truncated Path -> Alias:
> /test [test]
> Reducer 2
> Execution mode: vectorized
> Needs Tagging: false
> Reduce Operator Tree:
> Group By Operator
> aggregations: count(VALUE._col0)
> keys: KEY._col0 (type: int)
> mode: mergepartial
> outputColumnNames: _col0, _col1
> Statistics: Num rows: 1 Data size: 13500 Basic stats:
> COMPLETE Column stats: NONE
> Limit
> Number of rows: 10
> Statistics: Num rows: 1 Data size: 13500 Basic stats:
> COMPLETE Column stats: NONE
> File Output Operator
> compressed: false
> GlobalTableId: 0
> directory:
> file:/tmp/root/7160ea24-52b9-47c3-aafc-c9200263a1c6/hive_2021-01-04_11-38-31_601_4363062670409846390-1/-mr-10001/.hive-staging_hive_2021-01-04_11-38-31_601_4363062670409846390-1/-ext-10002
> NumFilesPerFileSink: 1
> Statistics: Num rows: 1 Data size: 13500 Basic stats:
> COMPLETE Column stats: NONE
> Stats Publishing Key Prefix:
> file:/tmp/root/7160ea24-52b9-47c3-aafc-c9200263a1c6/hive_2021-01-04_11-38-31_601_4363062670409846390-1/-mr-10001/.hive-staging_hive_2021-01-04_11-38-31_601_4363062670409846390-1/-ext-10002/
> table:
> input format:
> org.apache.hadoop.mapred.SequenceFileInputFormat
> output format:
> org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
> properties:
> columns _col0,_col1
> columns.types int:bigint
> escape.delim \
> hive.serialization.extend.additional.nesting.levels
> true
> serialization.escape.crlf true
> serialization.format 1
> serialization.lib
> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
> serde:
> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
> TotalFiles: 1
> GatherStats: false
> MultiFileSpray: false
> Stage: Stage-0
> Fetch Operator
> limit: 10
> Processor Tree:
> ListSink
> Time taken: 0.111 seconds, Fetched: 141 row(s)
> {code}
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)