[
https://issues.apache.org/jira/browse/HIVE-14731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15726343#comment-15726343
]
Jesus Camacho Rodriguez commented on HIVE-14731:
------------------------------------------------
[~aplusplus], I have been trying your patch on top of HIVE-15251, which
introduces the possibility to have complex conditions in the ON clause of outer
joins.
If I understand correctly the behavior of the new edge, we will have a problem
with outer joins plus this patch, as when we set the flag
_hive.tez.cartesian-product.enabled_ to _true_, XPROD\_EDGE is introduced even
for outer joins. Assume we have the following query without keys:
{code:sql}
EXPLAIN
SELECT *
FROM test1 LEFT OUTER JOIN test2
ON (test1.value=test2.value
OR test1.key between 100 and 102
OR test2.key between 100 and 102);
{code}
The plan produced is the following:
{code}
OK
STAGE DEPENDENCIES:
Stage-1 is a root stage
Stage-0 depends on stages: Stage-1
STAGE PLANS:
Stage: Stage-1
Tez
DagId:
jcamachorodriguez_20161206094821_b3595ae2-642d-4610-aa18-c349fb60f37d:12
Edges:
Reducer 2 <- Map 1 (XPROD_EDGE), Map 3 (XPROD_EDGE)
DagName:
Vertices:
Map 1
Map Operator Tree:
TableScan
alias: test1
Statistics: Num rows: 6 Data size: 56 Basic stats: COMPLETE
Column stats: NONE
Select Operator
expressions: key (type: int), value (type: int), col_1
(type: string)
outputColumnNames: _col0, _col1, _col2
Statistics: Num rows: 6 Data size: 56 Basic stats: COMPLETE
Column stats: NONE
Reduce Output Operator
sort order:
Statistics: Num rows: 6 Data size: 56 Basic stats:
COMPLETE Column stats: NONE
value expressions: _col0 (type: int), _col1 (type: int),
_col2 (type: string)
Execution mode: vectorized
LLAP IO: no inputs
Map 3
Map Operator Tree:
TableScan
alias: test2
Statistics: Num rows: 4 Data size: 38 Basic stats: COMPLETE
Column stats: NONE
Select Operator
expressions: key (type: int), value (type: int), col_2
(type: string)
outputColumnNames: _col0, _col1, _col2
Statistics: Num rows: 4 Data size: 38 Basic stats: COMPLETE
Column stats: NONE
Reduce Output Operator
sort order:
Statistics: Num rows: 4 Data size: 38 Basic stats:
COMPLETE Column stats: NONE
value expressions: _col0 (type: int), _col1 (type: int),
_col2 (type: string)
Execution mode: vectorized
LLAP IO: no inputs
Reducer 2
Reduce Operator Tree:
Merge Join Operator
condition map:
Left Outer Join0 to 1
keys:
0
1
outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5
residual filter predicates: {((_col1 = _col4) or _col0 BETWEEN
100 AND 102 or _col3 BETWEEN 100 AND 102)}
Statistics: Num rows: 24 Data size: 476 Basic stats: COMPLETE
Column stats: NONE
File Output Operator
compressed: false
Statistics: Num rows: 24 Data size: 476 Basic stats: COMPLETE
Column stats: NONE
table:
input format:
org.apache.hadoop.mapred.SequenceFileInputFormat
output format:
org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
Stage: Stage-0
Fetch Operator
limit: -1
Processor Tree:
ListSink
Time taken: 0.211 seconds, Fetched: 67 row(s)
{code}
If we have multiple tasks, the query will produce wrong results: if left input
has _n_ tasks, the join will create _n_ copies of each record in the left input
for which the condition in the ON clause does not evaluate to true (with NULL
values for the right fields). Am I right?
I think your code should only introduce XPROD\_EDGE for inner joins at the
moment. Outer joins will need some further work on the Hive side, e.g., method
call to consolidate the final results for the outer join and produce the record
with NULL values iff no result for that record came out of any partition.
> Use Tez cartesian product edge in Hive (unpartitioned case only)
> ----------------------------------------------------------------
>
> Key: HIVE-14731
> URL: https://issues.apache.org/jira/browse/HIVE-14731
> Project: Hive
> Issue Type: Bug
> Reporter: Zhiyuan Yang
> Assignee: Zhiyuan Yang
> Attachments: HIVE-14731.1.patch, HIVE-14731.2.patch,
> HIVE-14731.3.patch, HIVE-14731.4.patch, HIVE-14731.5.patch,
> HIVE-14731.6.patch, HIVE-14731.7.patch, HIVE-14731.8.patch, HIVE-14731.9.patch
>
>
> Given cartesian product edge is available in Tez now (see TEZ-3230), let's
> integrate it into Hive on Tez. This allows us to have more than one reducer
> in cross product queries.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)