[
https://issues.apache.org/jira/browse/HIVE-12664?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15068569#comment-15068569
]
Ashutosh Chauhan commented on HIVE-12664:
-----------------------------------------
Can you paste stack trace of error message you got? I tried to repro this on
master but couldn't. Got following explain plan:
{code}
STAGE DEPENDENCIES:
Stage-1 is a root stage
Stage-2 depends on stages: Stage-1, Stage-4
Stage-3 depends on stages: Stage-2
Stage-4 is a root stage
Stage-0 depends on stages: Stage-3
STAGE PLANS:
Stage: Stage-1
Map Reduce
Map Operator Tree:
TableScan
alias: www_access
filterExpr: host is not null (type: boolean)
Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL Column
stats: NONE
Filter Operator
predicate: host is not null (type: boolean)
Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL Column
stats: NONE
Select Operator
expressions: host (type: string), time (type: int)
outputColumnNames: host, time
Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL
Column stats: NONE
Group By Operator
aggregations: min(time)
keys: host (type: string)
mode: hash
outputColumnNames: _col0, _col1
Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL
Column stats: NONE
Reduce Output Operator
key expressions: _col0 (type: string)
sort order: +
Map-reduce partition columns: _col0 (type: string)
Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL
Column stats: NONE
value expressions: _col1 (type: int)
Reduce Operator Tree:
Group By Operator
aggregations: min(VALUE._col0)
keys: KEY._col0 (type: string)
mode: mergepartial
outputColumnNames: _col0, _col1
Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL Column
stats: NONE
File Output Operator
compressed: false
table:
input format: org.apache.hadoop.mapred.SequenceFileInputFormat
output format:
org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
serde: org.apache.hadoop.hive.serde2.lazybinary.LazyBinarySerDe
Stage: Stage-2
Map Reduce
Map Operator Tree:
TableScan
Reduce Output Operator
key expressions: _col0 (type: string)
sort order: +
Map-reduce partition columns: _col0 (type: string)
Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL Column
stats: NONE
value expressions: _col1 (type: int)
TableScan
Reduce Output Operator
key expressions: _col0 (type: string)
sort order: +
Map-reduce partition columns: _col0 (type: string)
Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL Column
stats: NONE
value expressions: _col1 (type: int)
Reduce Operator Tree:
Join Operator
condition map:
Inner Join 0 to 1
keys:
0 _col0 (type: string)
1 _col0 (type: string)
outputColumnNames: _col0, _col1, _col3
Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL Column
stats: NONE
Group By Operator
aggregations: count(DISTINCT _col1), max(_col3)
keys: _col0 (type: string), _col1 (type: int)
mode: hash
outputColumnNames: _col0, _col1, _col2, _col3
Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL Column
stats: NONE
File Output Operator
compressed: false
table:
input format: org.apache.hadoop.mapred.SequenceFileInputFormat
output format:
org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
serde:
org.apache.hadoop.hive.serde2.lazybinary.LazyBinarySerDe
Stage: Stage-3
Map Reduce
Map Operator Tree:
TableScan
Reduce Output Operator
key expressions: _col0 (type: string), _col1 (type: int)
sort order: ++
Map-reduce partition columns: _col0 (type: string)
Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL Column
stats: NONE
value expressions: _col3 (type: int)
Reduce Operator Tree:
Group By Operator
aggregations: count(DISTINCT KEY._col1:0._col0), max(VALUE._col1)
keys: KEY._col0 (type: string)
mode: mergepartial
outputColumnNames: _col0, _col1, _col2
Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL Column
stats: NONE
Select Operator
expressions: _col0 (type: string), _col1 (type: bigint), _col2
(type: int), 1450811390 (type: bigint)
outputColumnNames: _col0, _col1, _col2, _col3
Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL Column
stats: NONE
File Output Operator
compressed: false
Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL Column
stats: NONE
table:
input format: org.apache.hadoop.mapred.TextInputFormat
output format:
org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
Stage: Stage-4
Map Reduce
Map Operator Tree:
TableScan
alias: www_access
filterExpr: host is not null (type: boolean)
Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL Column
stats: NONE
Filter Operator
predicate: host is not null (type: boolean)
Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL Column
stats: NONE
Select Operator
expressions: host (type: string), time (type: int)
outputColumnNames: host, time
Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL
Column stats: NONE
Group By Operator
aggregations: min(time)
keys: host (type: string)
mode: hash
outputColumnNames: _col0, _col1
Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL
Column stats: NONE
Reduce Output Operator
key expressions: _col0 (type: string)
sort order: +
Map-reduce partition columns: _col0 (type: string)
Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL
Column stats: NONE
value expressions: _col1 (type: int)
Reduce Operator Tree:
Group By Operator
aggregations: min(VALUE._col0)
keys: KEY._col0 (type: string)
mode: mergepartial
outputColumnNames: _col0, _col1
Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL Column
stats: NONE
File Output Operator
compressed: false
table:
input format: org.apache.hadoop.mapred.SequenceFileInputFormat
output format:
org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
serde: org.apache.hadoop.hive.serde2.lazybinary.LazyBinarySerDe
Stage: Stage-0
Fetch Operator
limit: -1
Processor Tree:
ListSink
{code}
Were there any other config settings needed to trigger bug.
Logic of checking just only one branch of join doesn't look correct.
> Bug in reduce deduplication optimization causing ArrayOutOfBoundException
> -------------------------------------------------------------------------
>
> Key: HIVE-12664
> URL: https://issues.apache.org/jira/browse/HIVE-12664
> Project: Hive
> Issue Type: Bug
> Components: Hive
> Affects Versions: 1.1.1, 1.2.1
> Reporter: Johan Gustavsson
> Assignee: Johan Gustavsson
> Attachments: HIVE-12664-1.patch, HIVE-12664.1.patch, HIVE-12664.patch
>
>
> The optimisation check for reduce deduplication only checks the first child
> node for join -and the check itself also contains a major bug- causing
> ArrayOutOfBoundException no matter what.
> Sample data table form:
> ||time||user||host||path||referer||code||agent||size||method||
> |int|string|string|string|string|bigint|string|bigint|string|
> Sample query
> {code:sql}
> SELECT
> t1.host,
> COUNT(DISTINCT t1.`date`) AS login_count,
> MAX(t2.code) AS code,
> unix_timestamp() AS time
> FROM (
> SELECT
> HOST,
> MIN(time) AS DATE
> FROM
> www_access
> WHERE
> HOST IS NOT NULL
> GROUP BY
> HOST
> ) t1
> JOIN (
> SELECT
> HOST,
> MIN(time) AS code
> FROM
> www_access
> WHERE
> HOST IS NOT NULL
> GROUP BY
> HOST
> ) t2
> ON t1.host = t2.host
> GROUP BY
> t1.host
> {code}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)