[jira] [Commented] (HIVE-14731) Use Tez cartesian product edge in Hive (unpartitioned case only)

Jesus Camacho Rodriguez (JIRA) Tue, 06 Dec 2016 10:45:26 -0800

    [ 
https://issues.apache.org/jira/browse/HIVE-14731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15726343#comment-15726343
 ]


Jesus Camacho Rodriguez commented on HIVE-14731:
------------------------------------------------

[~aplusplus], I have been trying your patch on top of HIVE-15251, which 
introduces the possibility to have complex conditions in the ON clause of outer 
joins.

If I understand correctly the behavior of the new edge, we will have a problem 
with outer joins plus this patch, as when we set the flag 
_hive.tez.cartesian-product.enabled_ to _true_, XPROD\_EDGE is introduced even 
for outer joins. Assume we have the following query without keys:

{code:sql}
EXPLAIN
SELECT *
FROM test1 LEFT OUTER JOIN test2
ON (test1.value=test2.value
  OR test1.key between 100 and 102
  OR test2.key between 100 and 102);
{code}

The plan produced is the following:
{code}
OK
STAGE DEPENDENCIES:
  Stage-1 is a root stage
  Stage-0 depends on stages: Stage-1

STAGE PLANS:
  Stage: Stage-1
    Tez
      DagId: 
jcamachorodriguez_20161206094821_b3595ae2-642d-4610-aa18-c349fb60f37d:12
      Edges:
        Reducer 2 <- Map 1 (XPROD_EDGE), Map 3 (XPROD_EDGE)
      DagName:
      Vertices:
        Map 1
            Map Operator Tree:
                TableScan
                  alias: test1
                  Statistics: Num rows: 6 Data size: 56 Basic stats: COMPLETE 
Column stats: NONE
                  Select Operator
                    expressions: key (type: int), value (type: int), col_1 
(type: string)
                    outputColumnNames: _col0, _col1, _col2
                    Statistics: Num rows: 6 Data size: 56 Basic stats: COMPLETE 
Column stats: NONE
                    Reduce Output Operator
                      sort order:
                      Statistics: Num rows: 6 Data size: 56 Basic stats: 
COMPLETE Column stats: NONE
                      value expressions: _col0 (type: int), _col1 (type: int), 
_col2 (type: string)
            Execution mode: vectorized
            LLAP IO: no inputs
        Map 3
            Map Operator Tree:
                TableScan
                  alias: test2
                  Statistics: Num rows: 4 Data size: 38 Basic stats: COMPLETE 
Column stats: NONE
                  Select Operator
                    expressions: key (type: int), value (type: int), col_2 
(type: string)
                    outputColumnNames: _col0, _col1, _col2
                    Statistics: Num rows: 4 Data size: 38 Basic stats: COMPLETE 
Column stats: NONE
                    Reduce Output Operator
                      sort order:
                      Statistics: Num rows: 4 Data size: 38 Basic stats: 
COMPLETE Column stats: NONE
                      value expressions: _col0 (type: int), _col1 (type: int), 
_col2 (type: string)
            Execution mode: vectorized
            LLAP IO: no inputs
        Reducer 2
            Reduce Operator Tree:
              Merge Join Operator
                condition map:
                     Left Outer Join0 to 1
                keys:
                  0
                  1
                outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5
                residual filter predicates: {((_col1 = _col4) or _col0 BETWEEN 
100 AND 102 or _col3 BETWEEN 100 AND 102)}
                Statistics: Num rows: 24 Data size: 476 Basic stats: COMPLETE 
Column stats: NONE
                File Output Operator
                  compressed: false
                  Statistics: Num rows: 24 Data size: 476 Basic stats: COMPLETE 
Column stats: NONE
                  table:
                      input format: 
org.apache.hadoop.mapred.SequenceFileInputFormat
                      output format: 
org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
                      serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe

  Stage: Stage-0
    Fetch Operator
      limit: -1
      Processor Tree:
        ListSink

Time taken: 0.211 seconds, Fetched: 67 row(s)
{code}

If we have multiple tasks, the query will produce wrong results: if left input 
has _n_ tasks, the join will create _n_ copies of each record in the left input 
for which the condition in the ON clause does not evaluate to true (with NULL 
values for the right fields). Am I right?

I think your code should only introduce XPROD\_EDGE for inner joins at the 
moment. Outer joins will need some further work on the Hive side, e.g., method 
call to consolidate the final results for the outer join and produce the record 
with NULL values iff no result for that record came out of any partition.

> Use Tez cartesian product edge in Hive (unpartitioned case only)
> ----------------------------------------------------------------
>
>                 Key: HIVE-14731
>                 URL: https://issues.apache.org/jira/browse/HIVE-14731
>             Project: Hive
>          Issue Type: Bug
>            Reporter: Zhiyuan Yang
>            Assignee: Zhiyuan Yang
>         Attachments: HIVE-14731.1.patch, HIVE-14731.2.patch, 
> HIVE-14731.3.patch, HIVE-14731.4.patch, HIVE-14731.5.patch, 
> HIVE-14731.6.patch, HIVE-14731.7.patch, HIVE-14731.8.patch, HIVE-14731.9.patch
>
>
> Given cartesian product edge is available in Tez now (see TEZ-3230), let's 
> integrate it into Hive on Tez. This allows us to have more than one reducer 
> in cross product queries.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HIVE-14731) Use Tez cartesian product edge in Hive (unpartitioned case only)

Reply via email to