Siddharth Seth created HIVE-15122:
-------------------------------------
Summary: Hive: Upcasting types should not obscure stats
(min/max/ndv)
Key: HIVE-15122
URL: https://issues.apache.org/jira/browse/HIVE-15122
Project: Hive
Issue Type: Bug
Reporter: Siddharth Seth
A UDFToLong breaks PK/FK inferences and triggers mis-estimation of joins in
LLAP.
Snippet from the bad plan.
{code}
| STAGE PLANS:
|
| Stage: Stage-1
|
| Tez
|
| DagId: hive_20161031222730_a700058f-78eb-40d6-a67d-43add60a50e2:6
|
| Edges:
|
| Map 2 <- Map 1 (BROADCAST_EDGE)
|
| Map 3 <- Map 2 (BROADCAST_EDGE)
|
| Reducer 4 <- Map 3 (CUSTOM_SIMPLE_EDGE), Map 7 (CUSTOM_SIMPLE_EDGE),
Map 8 (BROADCAST_EDGE), Map 9 (BROADCAST_EDGE)
|
| Reducer 5 <- Reducer 4 (SIMPLE_EDGE)
|
| Reducer 6 <- Reducer 5 (SIMPLE_EDGE)
|
| DagName:
|
| Vertices:
|
| Map 1
|
| Map Operator Tree:
|
| TableScan
|
| alias: supplier
|
| filterExpr: (s_suppkey is not null and s_nationkey is not
null) (type: boolean)
|
| Statistics: Num rows: 10000000 Data size: 160000000 Basic
stats: COMPLETE Column stats: COMPLETE
|
| Filter Operator
|
| predicate: (s_suppkey is not null and s_nationkey is not
null) (type: boolean)
|
| Statistics: Num rows: 10000000 Data size: 160000000 Basic
stats: COMPLETE Column stats: COMPLETE
|
| Select Operator
|
| expressions: s_suppkey (type: bigint), s_nationkey
(type: bigint)
|
| outputColumnNames: _col0, _col1
|
| Statistics: Num rows: 10000000 Data size: 160000000
Basic stats: COMPLETE Column stats: COMPLETE
|
| Reduce Output Operator
|
| key expressions: _col0 (type: bigint)
|
| sort order: +
|
| Map-reduce partition columns: _col0 (type: bigint)
|
| Statistics: Num rows: 10000000 Data size: 160000000
Basic stats: COMPLETE Column stats: COMPLETE
|
| value expressions: _col1 (type: bigint)
|
| Execution mode: vectorized, llap
|
| LLAP IO: all inputs
|
| Map 2
|
| Map Operator Tree:
|
| TableScan
|
| alias: lineitem
|
| filterExpr: (l_suppkey is not null and l_orderkey is not
null) (type: boolean)
|
| Statistics: Num rows: 2285121364 Data size: 63983407882
Basic stats: COMPLETE Column stats: PARTIAL
|
| Filter Operator
|
| predicate: (l_suppkey is not null and l_orderkey is not
null) (type: boolean)
|
| Statistics: Num rows: 2285121364 Data size: 127966796384
Basic stats: COMPLETE Column stats: PARTIAL
|
| Select Operator
|
| expressions: l_orderkey (type: bigint), l_suppkey
(type: int), l_extendedprice (type: double), l_discount (type: double),
l_shipdate (type: date) |
| outputColumnNames: _col0, _col1, _col2, _col3, _col4
|
| Statistics: Num rows: 2285121364 Data size:
127966796384 Basic stats: COMPLETE Column stats: PARTIAL
|
| Map Join Operator
|
| condition map:
|
| Inner Join 0 to 1
|
| keys:
|
| 0 _col0 (type: bigint)
|
| 1 UDFToLong(_col1) (type: bigint)
|
| outputColumnNames: _col1, _col2, _col4, _col5, _col6
|
| input vertices:
|
| 0 Map 1
|
| Statistics: Num rows: 10000000 Data size: 880000000
Basic stats: COMPLETE Column stats: PARTIAL
|
| Reduce Output Operator
|
| key expressions: _col2 (type: bigint)
|
| sort order: +
|
| Map-reduce partition columns: _col2 (type: bigint)
|
| Statistics: Num rows: 10000000 Data size: 880000000
Basic stats: COMPLETE Column stats: PARTIAL
|
| value expressions: _col1 (type: bigint), _col4
(type: double), _col5 (type: double), _col6 (type: date)
|
| Execution mode: vectorized, llap
|
| LLAP IO: all inputs
|
| Map 3
|
| Map Operator Tree:
|
| TableScan
|
| alias: orders
|
| filterExpr: (o_orderkey is not null and o_custkey is not
null) (type: boolean)
|
| Statistics: Num rows: 4318801126 Data size: 51825626753
Basic stats: COMPLETE Column stats: NONE
|
| Filter Operator
|
| predicate: (o_orderkey is not null and o_custkey is not
null) (type: boolean)
|
| Statistics: Num rows: 4318801126 Data size: 51825626753
Basic stats: COMPLETE Column stats: NONE
|
| Select Operator
|
| expressions: o_orderkey (type: int), o_custkey (type:
bigint)
|
| outputColumnNames: _col0, _col1
|
| Statistics: Num rows: 4318801126 Data size: 51825626753
Basic stats: COMPLETE Column stats: NONE
|
| Map Join Operator
|
| condition map:
|
| Inner Join 0 to 1
|
| keys:
|
| 0 _col2 (type: bigint)
|
| 1 UDFToLong(_col0) (type: bigint)
|
| outputColumnNames: _col1, _col4, _col5, _col6, _col8
|
| input vertices:
|
| 0 Map 2
|
| Statistics: Num rows: 4750681341 Data size:
57008190663 Basic stats: COMPLETE Column stats: NONE
|
| Reduce Output Operator
|
| key expressions: _col8 (type: bigint)
|
| sort order: +
|
| Map-reduce partition columns: _col8 (type: bigint)
|
| Statistics: Num rows: 4750681341 Data size:
57008190663 Basic stats: COMPLETE Column stats: NONE
|
| value expressions: _col1 (type: bigint), _col4
(type: double), _col5 (type: double), _col6 (type: date)
|
| Execution mode: vectorized, llap
|
| LLAP IO: all inputs
|
| Map 7
{code}
Note the Map2 to Map3 output.
This causes a rather large join (120GB) to be categorized as a map-join.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
