[jira] [Commented] (HIVE-12664) Bug in reduce deduplication optimization causing ArrayOutOfBoundException

Ashutosh Chauhan (JIRA) Tue, 22 Dec 2015 11:13:40 -0800

    [ 
https://issues.apache.org/jira/browse/HIVE-12664?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15068569#comment-15068569
 ]


Ashutosh Chauhan commented on HIVE-12664:
-----------------------------------------

Can you paste stack trace of error message you got? I tried to repro this on 
master but couldn't.  Got following explain plan:
{code}
STAGE DEPENDENCIES:
  Stage-1 is a root stage
  Stage-2 depends on stages: Stage-1, Stage-4
  Stage-3 depends on stages: Stage-2
  Stage-4 is a root stage
  Stage-0 depends on stages: Stage-3

STAGE PLANS:
  Stage: Stage-1
    Map Reduce
      Map Operator Tree:
          TableScan
            alias: www_access
            filterExpr: host is not null (type: boolean)
            Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL Column 
stats: NONE
            Filter Operator
              predicate: host is not null (type: boolean)
              Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL Column 
stats: NONE
              Select Operator
                expressions: host (type: string), time (type: int)
                outputColumnNames: host, time
                Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL 
Column stats: NONE
                Group By Operator
                  aggregations: min(time)
                  keys: host (type: string)
                  mode: hash
                  outputColumnNames: _col0, _col1
                  Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL 
Column stats: NONE
                  Reduce Output Operator
                    key expressions: _col0 (type: string)
                    sort order: +
                    Map-reduce partition columns: _col0 (type: string)
                    Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL 
Column stats: NONE
                    value expressions: _col1 (type: int)
      Reduce Operator Tree:
        Group By Operator
          aggregations: min(VALUE._col0)
          keys: KEY._col0 (type: string)
          mode: mergepartial
          outputColumnNames: _col0, _col1
          Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL Column 
stats: NONE
          File Output Operator
            compressed: false
            table:
                input format: org.apache.hadoop.mapred.SequenceFileInputFormat
                output format: 
org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
                serde: org.apache.hadoop.hive.serde2.lazybinary.LazyBinarySerDe

  Stage: Stage-2
    Map Reduce
      Map Operator Tree:
          TableScan
            Reduce Output Operator
              key expressions: _col0 (type: string)
              sort order: +
              Map-reduce partition columns: _col0 (type: string)
              Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL Column 
stats: NONE
              value expressions: _col1 (type: int)
          TableScan
            Reduce Output Operator
              key expressions: _col0 (type: string)
              sort order: +
              Map-reduce partition columns: _col0 (type: string)
              Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL Column 
stats: NONE
              value expressions: _col1 (type: int)
      Reduce Operator Tree:
        Join Operator
          condition map:
               Inner Join 0 to 1
          keys:
            0 _col0 (type: string)
            1 _col0 (type: string)
          outputColumnNames: _col0, _col1, _col3
          Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL Column 
stats: NONE
          Group By Operator
            aggregations: count(DISTINCT _col1), max(_col3)
            keys: _col0 (type: string), _col1 (type: int)
            mode: hash
            outputColumnNames: _col0, _col1, _col2, _col3
            Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL Column 
stats: NONE
            File Output Operator
              compressed: false
              table:
                  input format: org.apache.hadoop.mapred.SequenceFileInputFormat
                  output format: 
org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
                  serde: 
org.apache.hadoop.hive.serde2.lazybinary.LazyBinarySerDe

  Stage: Stage-3
    Map Reduce
      Map Operator Tree:
          TableScan
            Reduce Output Operator
              key expressions: _col0 (type: string), _col1 (type: int)
              sort order: ++
              Map-reduce partition columns: _col0 (type: string)
              Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL Column 
stats: NONE
              value expressions: _col3 (type: int)
      Reduce Operator Tree:
        Group By Operator
          aggregations: count(DISTINCT KEY._col1:0._col0), max(VALUE._col1)
          keys: KEY._col0 (type: string)
          mode: mergepartial
          outputColumnNames: _col0, _col1, _col2
          Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL Column 
stats: NONE
          Select Operator
            expressions: _col0 (type: string), _col1 (type: bigint), _col2 
(type: int), 1450811390 (type: bigint)
            outputColumnNames: _col0, _col1, _col2, _col3
            Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL Column 
stats: NONE
            File Output Operator
              compressed: false
              Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL Column 
stats: NONE
              table:
                  input format: org.apache.hadoop.mapred.TextInputFormat
                  output format: 
org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
                  serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe

  Stage: Stage-4
    Map Reduce
      Map Operator Tree:
          TableScan
            alias: www_access
            filterExpr: host is not null (type: boolean)
            Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL Column 
stats: NONE
            Filter Operator
              predicate: host is not null (type: boolean)
              Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL Column 
stats: NONE
              Select Operator
                expressions: host (type: string), time (type: int)
                outputColumnNames: host, time
                Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL 
Column stats: NONE
                Group By Operator
                  aggregations: min(time)
                  keys: host (type: string)
                  mode: hash
                  outputColumnNames: _col0, _col1
                  Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL 
Column stats: NONE
                  Reduce Output Operator
                    key expressions: _col0 (type: string)
                    sort order: +
                    Map-reduce partition columns: _col0 (type: string)
                    Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL 
Column stats: NONE
                    value expressions: _col1 (type: int)
      Reduce Operator Tree:
        Group By Operator
          aggregations: min(VALUE._col0)
          keys: KEY._col0 (type: string)
          mode: mergepartial
          outputColumnNames: _col0, _col1
          Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL Column 
stats: NONE
          File Output Operator
            compressed: false
            table:
                input format: org.apache.hadoop.mapred.SequenceFileInputFormat
                output format: 
org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
                serde: org.apache.hadoop.hive.serde2.lazybinary.LazyBinarySerDe

  Stage: Stage-0
    Fetch Operator
      limit: -1
      Processor Tree:
        ListSink


{code}

Were there any other config settings needed to trigger bug.
Logic of checking just only one branch of join doesn't look correct.



> Bug in reduce deduplication optimization causing ArrayOutOfBoundException
> -------------------------------------------------------------------------
>
>                 Key: HIVE-12664
>                 URL: https://issues.apache.org/jira/browse/HIVE-12664
>             Project: Hive
>          Issue Type: Bug
>          Components: Hive
>    Affects Versions: 1.1.1, 1.2.1
>            Reporter: Johan Gustavsson
>            Assignee: Johan Gustavsson
>         Attachments: HIVE-12664-1.patch, HIVE-12664.1.patch, HIVE-12664.patch
>
>
> The optimisation check for reduce deduplication only checks the first child 
> node for join -and the check itself also contains a major bug- causing 
> ArrayOutOfBoundException no matter what.
> Sample data table form:
> ||time||user||host||path||referer||code||agent||size||method||
> |int|string|string|string|string|bigint|string|bigint|string|
> Sample query
> {code:sql}
> SELECT 
>   t1.host,
>   COUNT(DISTINCT t1.`date`) AS login_count,
>   MAX(t2.code) AS code,
>   unix_timestamp() AS time
> FROM (
>     SELECT 
>       HOST,
>       MIN(time) AS DATE
>     FROM
>       www_access
>     WHERE
>       HOST IS NOT NULL
>     GROUP BY
>       HOST
>   ) t1
> JOIN (
>     SELECT 
>       HOST,
>       MIN(time) AS code
>     FROM
>       www_access
>     WHERE
>       HOST IS NOT NULL
>     GROUP BY
>       HOST
>   ) t2
>   ON t1.host = t2.host
> GROUP BY
>   t1.host
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HIVE-12664) Bug in reduce deduplication optimization causing ArrayOutOfBoundException

Reply via email to