[jira] [Work logged] (HIVE-18284) NPE when inserting data with 'distribute by' clause with dynpart sort optimization

ASF GitHub Bot (Jira) Sun, 27 Sep 2020 23:44:44 -0700


     [ 
https://issues.apache.org/jira/browse/HIVE-18284?focusedWorklogId=491820&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-491820
 ]


ASF GitHub Bot logged work on HIVE-18284:
-----------------------------------------

                Author: ASF GitHub Bot
            Created on: 28/Sep/20 06:43
            Start Date: 28/Sep/20 06:43
    Worklog Time Spent: 10m 
      Work Description: shameersss1 commented on a change in pull request #1400:
URL: https://github.com/apache/hive/pull/1400#discussion_r495719674



##########
File path: 
ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/ReduceSinkDeDuplicationUtils.java
##########
@@ -181,6 +183,23 @@ public static boolean merge(HiveConf hiveConf, 
ReduceSinkOperator cRS, ReduceSin
         TableDesc keyTable = PlanUtils.getReduceKeyTableDesc(new 
ArrayList<FieldSchema>(), pRS
             .getConf().getOrder(), pRS.getConf().getNullOrder());
         pRS.getConf().setKeySerializeInfo(keyTable);
+      } else if (cRS.getConf().getKeyCols() != null && 
cRS.getConf().getKeyCols().size() > 0) {
+        ArrayList<String> keyColNames = Lists.newArrayList();
+        for (ExprNodeDesc keyCol : pRS.getConf().getKeyCols()) {
+          String keyColName = keyCol.getExprString();
+          keyColNames.add(keyColName);
+        }
+        List<FieldSchema> fields = 
PlanUtils.getFieldSchemasFromColumnList(pRS.getConf().getKeyCols(),
+            keyColNames, 0, "");
+        TableDesc keyTable = PlanUtils.getReduceKeyTableDesc(fields, 
pRS.getConf().getOrder(),
+            pRS.getConf().getNullOrder());
+        ArrayList<String> outputKeyCols = Lists.newArrayList();
+        for (int i = 0; i < fields.size(); i++) {
+          outputKeyCols.add(fields.get(i).getName());
+        }
+        pRS.getConf().setOutputKeyColumnNames(outputKeyCols);
+        pRS.getConf().setKeySerializeInfo(keyTable);
+        
pRS.getConf().setNumDistributionKeys(cRS.getConf().getNumDistributionKeys());
       }

Review comment:
       Just to add more context here, Number of distribution keys of cRS is 
chosen only when numDistKeys of pRS is 0 or less. In all other cases, 
distribution of the keys is based on the pRS which is more generic than cRS. We 
will enter this "if" condition only in two cases
   
   1. pRS keyCol is empty and cRS keyCol is empty
   2. pRS keyCol is empty and cRS keyCol is not empty
   
   So in case I we would like to keep the pRS properties intact since pRS is 
more generic. In case (2) we want to go with cRS properties hence i think 
returning false is not required.
   
   Does this make sense? Or am i missing any thing?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]


Issue Time Tracking
-------------------

    Worklog Id:     (was: 491820)
    Time Spent: 1h 40m  (was: 1.5h)

> NPE when inserting data with 'distribute by' clause with dynpart sort 
> optimization
> ----------------------------------------------------------------------------------
>
>                 Key: HIVE-18284
>                 URL: https://issues.apache.org/jira/browse/HIVE-18284
>             Project: Hive
>          Issue Type: Bug
>          Components: Query Processor
>    Affects Versions: 2.3.1, 2.3.2
>            Reporter: Aki Tanaka
>            Assignee: Syed Shameerur Rahman
>            Priority: Major
>              Labels: pull-request-available
>          Time Spent: 1h 40m
>  Remaining Estimate: 0h
>
> A Null Pointer Exception occurs when inserting data with 'distribute by' 
> clause. The following snippet query reproduces this issue:
> *(non-vectorized , non-llap mode)*
> {code:java}
> create table table1 (col1 string, datekey int);
> insert into table1 values ('ROW1', 1), ('ROW2', 2), ('ROW3', 1);
> create table table2 (col1 string) partitioned by (datekey int);
> set hive.vectorized.execution.enabled=false;
> set hive.optimize.sort.dynamic.partition=true;
> set hive.exec.dynamic.partition.mode=nonstrict;
> insert into table table2
> PARTITION(datekey)
> select col1,
> datekey
> from table1
> distribute by datekey ;
> {code}
> I could run the insert query without the error if I remove Distribute By  or 
> use Cluster By clause.
> It seems that the issue happens because Distribute By does not guarantee 
> clustering or sorting properties on the distributed keys.
> FileSinkOperator removes the previous fsp. FileSinkOperator will remove the 
> previous fsp which might be re-used when we use Distribute By.
> https://github.com/apache/hive/blob/branch-2.3/ql/src/java/org/apache/hadoop/hive/ql/exec/FileSinkOperator.java#L972
> The following stack trace is logged.
> {code:java}
> Vertex failed, vertexName=Reducer 2, vertexId=vertex_1513111717879_0056_1_01, 
> diagnostics=[Task failed, taskId=task_1513111717879_0056_1_01_000000, 
> diagnostics=[TaskAttempt 0 failed, info=[Error: Error while running task ( 
> failure ) : 
> attempt_1513111717879_0056_1_01_000000_0:java.lang.RuntimeException: 
> org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime Error while 
> processing row (tag=0) {"key":{},"value":{"_col0":"ROW3","_col1":1}}
>       at 
> org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:211)
>       at 
> org.apache.hadoop.hive.ql.exec.tez.TezProcessor.run(TezProcessor.java:168)
>       at 
> org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.run(LogicalIOProcessorRuntimeTask.java:370)
>       at 
> org.apache.tez.runtime.task.TaskRunner2Callable$1.run(TaskRunner2Callable.java:73)
>       at 
> org.apache.tez.runtime.task.TaskRunner2Callable$1.run(TaskRunner2Callable.java:61)
>       at java.security.AccessController.doPrivileged(Native Method)
>       at javax.security.auth.Subject.doAs(Subject.java:422)
>       at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1698)
>       at 
> org.apache.tez.runtime.task.TaskRunner2Callable.callInternal(TaskRunner2Callable.java:61)
>       at 
> org.apache.tez.runtime.task.TaskRunner2Callable.callInternal(TaskRunner2Callable.java:37)
>       at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36)
>       at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>       at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>       at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>       at java.lang.Thread.run(Thread.java:748)
> Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime 
> Error while processing row (tag=0) 
> {"key":{},"value":{"_col0":"ROW3","_col1":1}}
>       at 
> org.apache.hadoop.hive.ql.exec.tez.ReduceRecordSource$GroupIterator.next(ReduceRecordSource.java:365)
>       at 
> org.apache.hadoop.hive.ql.exec.tez.ReduceRecordSource.pushRecord(ReduceRecordSource.java:250)
>       at 
> org.apache.hadoop.hive.ql.exec.tez.ReduceRecordProcessor.run(ReduceRecordProcessor.java:317)
>       at 
> org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:185)
>       ... 14 more
> Caused by: java.lang.NullPointerException
>       at 
> org.apache.hadoop.hive.ql.exec.FileSinkOperator.process(FileSinkOperator.java:762)
>       at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:897)
>       at 
> org.apache.hadoop.hive.ql.exec.SelectOperator.process(SelectOperator.java:95)
>       at 
> org.apache.hadoop.hive.ql.exec.tez.ReduceRecordSource$GroupIterator.next(ReduceRecordSource.java:356)
>       ... 17 more
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (HIVE-18284) NPE when inserting data with 'distribute by' clause with dynpart sort optimization

Reply via email to