[jira] [Commented] (HIVE-11410) Join with subquery containing a group by incorrectly returns no results

Matt McCline (JIRA) Fri, 31 Jul 2015 00:14:32 -0700

    [ 
https://issues.apache.org/jira/browse/HIVE-11410?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14648853#comment-14648853
 ]


Matt McCline commented on HIVE-11410:
-------------------------------------


Right, postgres produces:
{code}
mmccline=# SELECT
  t1.c1
FROM 
  t t1
JOIN
(SELECT 
   t2.c1,
   MAX(t2.c2) AS c2
 FROM 
   t t2 
 GROUP BY 
   t2.c1
) t3
ON t1.c2=t3.c2;
 c1 
-----
 abc
(1 row)
{code}

And, Hive branch-1.1 produces the right result:
{code}
SELECT 
  t1.c1
FROM 
  t t1
JOIN
(SELECT 
   t2.c1,
   MAX(t2.c2) AS c2
 FROM 
   t t2 
 GROUP BY 
   t2.c1
) t3
ON t1.c2=t3.c2;
abc
{code}

Here is the EXPLAIN plan:
{code}
EXPLAIN
SELECT 
  t1.c1
FROM 
  t t1
JOIN
(SELECT 
   t2.c1,
   MAX(t2.c2) AS c2
 FROM 
   t t2 
 GROUP BY 
   t2.c1
) t3
ON t1.c2=t3.c2;
STAGE DEPENDENCIES:
  Stage-1 is a root stage
  Stage-2 depends on stages: Stage-1
  Stage-0 depends on stages: Stage-2

STAGE PLANS:
  Stage: Stage-1
    Map Reduce
      Map Operator Tree:
          TableScan
            alias: t2
            Statistics: Num rows: 1 Data size: 5 Basic stats: COMPLETE Column 
stats: NONE
            Select Operator
              expressions: c1 (type: string), c2 (type: int)
              outputColumnNames: c1, c2
              Statistics: Num rows: 1 Data size: 5 Basic stats: COMPLETE Column 
stats: NONE
              Group By Operator
                aggregations: max(c2)
                keys: c1 (type: string)
                mode: hash
                outputColumnNames: _col0, _col1
                Statistics: Num rows: 1 Data size: 5 Basic stats: COMPLETE 
Column stats: NONE
                Reduce Output Operator
                  key expressions: _col0 (type: string)
                  sort order: +
                  Map-reduce partition columns: _col0 (type: string)
                  Statistics: Num rows: 1 Data size: 5 Basic stats: COMPLETE 
Column stats: NONE
                  value expressions: _col1 (type: int)
      Reduce Operator Tree:
        Group By Operator
          aggregations: max(VALUE._col0)
          keys: KEY._col0 (type: string)
          mode: mergepartial
          outputColumnNames: _col0, _col1
          Statistics: Num rows: 0 Data size: 0 Basic stats: NONE Column stats: 
NONE
          Select Operator
            expressions: _col1 (type: int)
            outputColumnNames: _col1
            Statistics: Num rows: 0 Data size: 0 Basic stats: NONE Column 
stats: NONE
            Filter Operator
              predicate: _col1 is not null (type: boolean)
              Statistics: Num rows: 0 Data size: 0 Basic stats: NONE Column 
stats: NONE
              Select Operator
                expressions: _col1 (type: int)
                outputColumnNames: _col1
                Statistics: Num rows: 0 Data size: 0 Basic stats: NONE Column 
stats: NONE
                File Output Operator
                  compressed: false
                  table:
                      input format: 
org.apache.hadoop.mapred.SequenceFileInputFormat
                      output format: 
org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
                      serde: 
org.apache.hadoop.hive.serde2.lazybinary.LazyBinarySerDe

  Stage: Stage-2
    Map Reduce
      Map Operator Tree:
          TableScan
            Reduce Output Operator
              key expressions: _col1 (type: int)
              sort order: +
              Map-reduce partition columns: _col1 (type: int)
              Statistics: Num rows: 0 Data size: 0 Basic stats: NONE Column 
stats: NONE
          TableScan
            alias: t1
            Statistics: Num rows: 1 Data size: 5 Basic stats: COMPLETE Column 
stats: NONE
            Filter Operator
              predicate: c2 is not null (type: boolean)
              Statistics: Num rows: 1 Data size: 5 Basic stats: COMPLETE Column 
stats: NONE
              Reduce Output Operator
                key expressions: c2 (type: int)
                sort order: +
                Map-reduce partition columns: c2 (type: int)
                Statistics: Num rows: 1 Data size: 5 Basic stats: COMPLETE 
Column stats: NONE
                value expressions: c1 (type: string)
      Reduce Operator Tree:
        Join Operator
          condition map:
               Inner Join 0 to 1
          keys:
            0 c2 (type: int)
            1 _col1 (type: int)
          outputColumnNames: _col0
          Statistics: Num rows: 1 Data size: 5 Basic stats: COMPLETE Column 
stats: NONE
          File Output Operator
            compressed: false
            Statistics: Num rows: 1 Data size: 5 Basic stats: COMPLETE Column 
stats: NONE
            table:
                input format: org.apache.hadoop.mapred.TextInputFormat
                output format: 
org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
                serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe

  Stage: Stage-0
    Fetch Operator
      limit: -1
      Processor Tree:
        ListSink
{code}

So, perhaps different environment variable setting(s) are causing a plan in 
branch-1.1 to be produced that the no results case for this JIRA.


> Join with subquery containing a group by incorrectly returns no results
> -----------------------------------------------------------------------
>
>                 Key: HIVE-11410
>                 URL: https://issues.apache.org/jira/browse/HIVE-11410
>             Project: Hive
>          Issue Type: Bug
>          Components: Hive
>    Affects Versions: 1.1.0
>            Reporter: Nicholas Brenwald
>            Assignee: Matt McCline
>            Priority: Minor
>         Attachments: hive-site.xml
>
>
> Start by creating a table *t* with columns *c1* and *c2* and populate with 1 
> row of data. For example create table *t* from an existing table which 
> contains at least 1 row of data by running:
> {code}
> create table t as select 'abc' as c1, 0 as c2 from Y limit 1; 
> {code}
> Table *t* looks like the following:
> ||c1||c2||
> |abc|0|
> Running the following query then returns zero results.
> {code}
> SELECT 
>   t1.c1
> FROM 
>   t t1
> JOIN
> (SELECT 
>    t2.c1,
>    MAX(t2.c2) AS c2
>  FROM 
>    t t2 
>  GROUP BY 
>    t2.c1
> ) t3
> ON t1.c2=t3.c2
> {code}
> However, we expected to see the following:
> ||c1||
> |abc|
> The problem seems to relate to the fact that in the subquery, we group by 
> column *c1*, but this is not subsequently used in the join condition.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HIVE-11410) Join with subquery containing a group by incorrectly returns no results

Reply via email to