[jira] [Updated] (HIVE-7232) VectorReduceSink is emitting incorrect JOIN keys

Gopal V (JIRA) Wed, 18 Jun 2014 09:51:12 -0700

     [ 
https://issues.apache.org/jira/browse/HIVE-7232?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Gopal V updated HIVE-7232:
--------------------------

    Description: 
After HIVE-7121, tpc-h query5 has resulted in incorrect results.

Thanks to [~navis], it has been tracked down to the auto-parallel settings 
which were initialized for ReduceSinkOperator, but not for 
VectorReduceSinkOperator. The vector version inherits, but doesn't call 
super.initializeOp() or set up the variable correctly from ReduceSinkDesc.

The query is tpc-h query5, with extra NULL checks just to be sure.

{code}
ELECT n_name,
       sum(l_extendedprice * (1 - l_discount)) AS revenue
FROM customer,
     orders,
     lineitem,
     supplier,
     nation,
     region
WHERE c_custkey = o_custkey
  AND l_orderkey = o_orderkey
  AND l_suppkey = s_suppkey
  AND c_nationkey = s_nationkey
  AND s_nationkey = n_nationkey
  AND n_regionkey = r_regionkey
  AND r_name = 'ASIA'
  AND o_orderdate >= '1994-01-01'
  AND o_orderdate < '1995-01-01'
  and l_orderkey is not null
  and c_custkey is not null
  and l_suppkey is not null
  and c_nationkey is not null
  and s_nationkey is not null
  and n_regionkey is not null
GROUP BY n_name
ORDER BY revenue DESC;
{code}

The reducer which has the issue has the following plan

{code}
Reducer 3
            Reduce Operator Tree:
              Join Operator
                condition map:
                     Inner Join 0 to 1
                condition expressions:
                  0 {KEY.reducesinkkey0} {VALUE._col2}
                  1 {VALUE._col0} {KEY.reducesinkkey0} {VALUE._col3}
                outputColumnNames: _col0, _col3, _col10, _col11, _col14
                Statistics: Num rows: 183333344 Data size: 95229140992 Basic 
stats: COMPLETE Column stats: NONE
                Reduce Output Operator
                  key expressions: _col10 (type: int)
                  sort order: +
                  Map-reduce partition columns: _col10 (type: int)
                  Statistics: Num rows: 183333344 Data size: 95229140992 Basic 
stats: COMPLETE Column stats: NONE
                  value expressions: _col0 (type: int), _col3 (type: int), 
_col11 (type: int), _col14 (type: string)
{code}

  was:
After HIVE-4867 has been merged in, some queries have exhibited a very weird 
skew towards NULL keys emitted from the ReduceSinkOperator.

Added extra logging to print expr.column() in ExprNodeColumnEvaluator & in 
reduce sink.

{code}
2014-06-14 00:37:19,186 INFO [TezChild] 
org.apache.hadoop.hive.ql.exec.ReduceSinkOperator:
numDistributionKeys = 1 {null --> ExprNodeColumnEvaluator(_col10)}
key_row={"reducesinkkey0":442}
{code}

{code}
      HiveKey firstKey = toHiveKey(cachedKeys[0], tag, null);
      int distKeyLength = firstKey.getDistKeyLength();
      if(distKeyLength <= 1) {
        StringBuffer x1 = new StringBuffer();
        x1.append("numDistributionKeys = "+ numDistributionKeys + "\n");
        for (int i = 0; i < numDistributionKeys; i++) {
            x1.append(cachedKeys[0][i] + " --> " + keyEval[i] + "\n");
        }
        x1.append("key_row="+ SerDeUtils.getJSONString(row, 
keyObjectInspector));
        LOG.info("GOPAL: " + x1.toString());
      }
{code}

The query is tpc-h query5, with extra NULL checks just to be sure.

{code}
ELECT n_name,
       sum(l_extendedprice * (1 - l_discount)) AS revenue
FROM customer,
     orders,
     lineitem,
     supplier,
     nation,
     region
WHERE c_custkey = o_custkey
  AND l_orderkey = o_orderkey
  AND l_suppkey = s_suppkey
  AND c_nationkey = s_nationkey
  AND s_nationkey = n_nationkey
  AND n_regionkey = r_regionkey
  AND r_name = 'ASIA'
  AND o_orderdate >= '1994-01-01'
  AND o_orderdate < '1995-01-01'
  and l_orderkey is not null
  and c_custkey is not null
  and l_suppkey is not null
  and c_nationkey is not null
  and s_nationkey is not null
  and n_regionkey is not null
GROUP BY n_name
ORDER BY revenue DESC;
{code}

The reducer which has the issue has the following plan

{code}
Reducer 3
            Reduce Operator Tree:
              Join Operator
                condition map:
                     Inner Join 0 to 1
                condition expressions:
                  0 {KEY.reducesinkkey0} {VALUE._col2}
                  1 {VALUE._col0} {KEY.reducesinkkey0} {VALUE._col3}
                outputColumnNames: _col0, _col3, _col10, _col11, _col14
                Statistics: Num rows: 183333344 Data size: 95229140992 Basic 
stats: COMPLETE Column stats: NONE
                Reduce Output Operator
                  key expressions: _col10 (type: int)
                  sort order: +
                  Map-reduce partition columns: _col10 (type: int)
                  Statistics: Num rows: 183333344 Data size: 95229140992 Basic 
stats: COMPLETE Column stats: NONE
                  value expressions: _col0 (type: int), _col3 (type: int), 
_col11 (type: int), _col14 (type: string)
{code}

        Summary: VectorReduceSink is emitting incorrect JOIN keys  (was: 
ReduceSink is emitting NULL keys due to failed keyEval)

updated bug report with analysis

> VectorReduceSink is emitting incorrect JOIN keys
> ------------------------------------------------
>
>                 Key: HIVE-7232
>                 URL: https://issues.apache.org/jira/browse/HIVE-7232
>             Project: Hive
>          Issue Type: Bug
>          Components: Query Processor
>    Affects Versions: 0.14.0
>            Reporter: Gopal V
>            Assignee: Gopal V
>         Attachments: HIVE-7232-extra-logging.patch, HIVE-7232.1.patch.txt, 
> q5.explain.txt, q5.sql
>
>
> After HIVE-7121, tpc-h query5 has resulted in incorrect results.
> Thanks to [~navis], it has been tracked down to the auto-parallel settings 
> which were initialized for ReduceSinkOperator, but not for 
> VectorReduceSinkOperator. The vector version inherits, but doesn't call 
> super.initializeOp() or set up the variable correctly from ReduceSinkDesc.
> The query is tpc-h query5, with extra NULL checks just to be sure.
> {code}
> ELECT n_name,
>        sum(l_extendedprice * (1 - l_discount)) AS revenue
> FROM customer,
>      orders,
>      lineitem,
>      supplier,
>      nation,
>      region
> WHERE c_custkey = o_custkey
>   AND l_orderkey = o_orderkey
>   AND l_suppkey = s_suppkey
>   AND c_nationkey = s_nationkey
>   AND s_nationkey = n_nationkey
>   AND n_regionkey = r_regionkey
>   AND r_name = 'ASIA'
>   AND o_orderdate >= '1994-01-01'
>   AND o_orderdate < '1995-01-01'
>   and l_orderkey is not null
>   and c_custkey is not null
>   and l_suppkey is not null
>   and c_nationkey is not null
>   and s_nationkey is not null
>   and n_regionkey is not null
> GROUP BY n_name
> ORDER BY revenue DESC;
> {code}
> The reducer which has the issue has the following plan
> {code}
> Reducer 3
>             Reduce Operator Tree:
>               Join Operator
>                 condition map:
>                      Inner Join 0 to 1
>                 condition expressions:
>                   0 {KEY.reducesinkkey0} {VALUE._col2}
>                   1 {VALUE._col0} {KEY.reducesinkkey0} {VALUE._col3}
>                 outputColumnNames: _col0, _col3, _col10, _col11, _col14
>                 Statistics: Num rows: 183333344 Data size: 95229140992 Basic 
> stats: COMPLETE Column stats: NONE
>                 Reduce Output Operator
>                   key expressions: _col10 (type: int)
>                   sort order: +
>                   Map-reduce partition columns: _col10 (type: int)
>                   Statistics: Num rows: 183333344 Data size: 95229140992 
> Basic stats: COMPLETE Column stats: NONE
>                   value expressions: _col0 (type: int), _col3 (type: int), 
> _col11 (type: int), _col14 (type: string)
> {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (HIVE-7232) VectorReduceSink is emitting incorrect JOIN keys

Reply via email to