[
https://issues.apache.org/jira/browse/HIVE-16823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16141446#comment-16141446
]
liyunzhang_intel commented on HIVE-16823:
-----------------------------------------
some update
{quote}
This is why the query runs if map join is disabled, in which case GBY is
followed by SEL/RS instead of SparkHashTableSinkOperator.
{quote}
more explain about this.
{code}
set spark.master=local;
set hive.optimize.ppd=true;
set hive.ppd.remove.duplicatefilters=true;
set hive.spark.dynamic.partition.pruning=false;
set hive.optimize.metadataonly=false;
set hive.optimize.index.filter=true;
set hive.vectorized.execution.enabled=true;
set hive.strict.checks.cartesian.product=false;
set hive.auto.convert.join=false;
set hive.cbo.enable=false;
set hive.optimize.constant.propagation=true;
select count(*) from srcpart join (select ds as ds, ds as `date` from srcpart
group by ds) s on (srcpart.ds = s.ds) where s.`date` = '2008-04-08';
{code}
the cbo is disabled and the explain is not right, the key of GroupBy in Reducer
is {{keys: '2008-04-08' (type: string)}} it should be {{keys: KEY._col0 (type:
string)}} but the query finishes successfully. The reason is there is
{{RS\[9\]}} after {{GBY\[4\]}}
{code}
GBY[4]-SEL[5]-RS[9]
{code}
{{RS\[9\]}} called following stack, OutputColumnManager#allocateOutputColumn
makes OutputColumnManager#getScratchColumnTypeNames returning value.
{code}
org.apache.hadoop.hive.ql.exec.vector.VectorizationContext$OutputColumnManager.allocateOutputColumn(VectorizationContext.java:478)
at
org.apache.hadoop.hive.ql.exec.vector.VectorizationContext.getConstantVectorExpression(VectorizationContext.java:1153)
at
org.apache.hadoop.hive.ql.exec.vector.VectorizationContext.getVectorExpression(VectorizationContext.java:688)
at
org.apache.hadoop.hive.ql.exec.vector.VectorizationContext.getVectorExpressions(VectorizationContext.java:590)
at
org.apache.hadoop.hive.ql.exec.vector.VectorizationContext.getVectorExpressions(VectorizationContext.java:578)
at
org.apache.hadoop.hive.ql.optimizer.physical.Vectorizer.canSpecializeReduceSink(Vectorizer.java:3490)
at
org.apache.hadoop.hive.ql.optimizer.physical.Vectorizer.vectorizeOperator(Vectorizer.java:4174)
at
org.apache.hadoop.hive.ql.optimizer.physical.Vectorizer$VectorizationNodeProcessor.doVectorize(Vectorizer.java:1632)
at
org.apache.hadoop.hive.ql.optimizer.physical.Vectorizer$ReduceWorkVectorizationNodeProcessor.process(Vectorizer.java:1772)
{code}
in the log, we can see after VectorizationNodeProcessor#doVectorizer
{{GBY\[4\]}} , the vectorization context is
{code}
2017-08-25T03:40:21,316 DEBUG [cd30697a-7797-4bbe-ad92-1fcec8a89689 main]
physical.Vectorizer: Vectorized ReduceWork reduce shuffle vectorization context
Context name __Reduce_Shuffle__, level 0, sorted projectionColumnMap
{0=KEY._col0}, scratchColumnTypeNames []
{code}
after VectorizationNodeProcessor#doVectorizer {{RS\[9\]}}, the vectorization
context is(here scratchColumnTypeNames returns value)
{code}
2017-08-25T03:48:00,245 DEBUG [cd30697a-7797-4bbe-ad92-1fcec8a89689 main]
physical.Vectorizer: vectorizeOperator
org.apache.hadoop.hive.ql.plan.ReduceSinkDesc
2017-08-25T03:48:43,101 DEBUG [cd30697a-7797-4bbe-ad92-1fcec8a89689 main]
physical.Vectorizer: Vectorized ReduceWork operator RS added vectorization
context Context name SEL, level 1, sorted projectionColumnMap {},
scratchColumnTypeNames [string]
{code}
The difference in scratchColumnTypeNames causes different value in outputBatch
in
[VectorGroupKeyHelper|https://github.com/kellyzly/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/exec/vector/VectorGroupKeyHelper.java#L107].
I guess arrayIndexOutOfBoundsException can be reproduced in following condition
whether in spark or tez mode.
1. cbo is disabled
2. there is no RS follows GBY in the reducer
> "ArrayIndexOutOfBoundsException" in
> spark_vectorized_dynamic_partition_pruning.q
> --------------------------------------------------------------------------------
>
> Key: HIVE-16823
> URL: https://issues.apache.org/jira/browse/HIVE-16823
> Project: Hive
> Issue Type: Bug
> Reporter: Jianguo Tian
> Assignee: liyunzhang_intel
> Attachments: explain.spark, explain.tez, HIVE-16823.1.patch,
> HIVE-16823.patch
>
>
> spark_vectorized_dynamic_partition_pruning.q
> {code}
> set hive.optimize.ppd=true;
> set hive.ppd.remove.duplicatefilters=true;
> set hive.spark.dynamic.partition.pruning=true;
> set hive.optimize.metadataonly=false;
> set hive.optimize.index.filter=true;
> set hive.vectorized.execution.enabled=true;
> set hive.strict.checks.cartesian.product=false;
> -- parent is reduce tasks
> select count(*) from srcpart join (select ds as ds, ds as `date` from srcpart
> group by ds) s on (srcpart.ds = s.ds) where s.`date` = '2008-04-08';
> {code}
> The exceptions are as follows:
> {code}
> 2017-06-05T09:20:31,468 ERROR [Executor task launch worker-0]
> spark.SparkReduceRecordHandler: Fatal error:
> org.apache.hadoop.hive.ql.metadata.HiveException: Error while processing
> vector batch (tag=0) Column vector types: 0:BYTES, 1:BYTES
> ["2008-04-08", "2008-04-08"]
> org.apache.hadoop.hive.ql.metadata.HiveException: Error while processing
> vector batch (tag=0) Column vector types: 0:BYTES, 1:BYTES
> ["2008-04-08", "2008-04-08"]
> at
> org.apache.hadoop.hive.ql.exec.spark.SparkReduceRecordHandler.processVectors(SparkReduceRecordHandler.java:413)
> ~[hive-exec-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
> at
> org.apache.hadoop.hive.ql.exec.spark.SparkReduceRecordHandler.processRow(SparkReduceRecordHandler.java:301)
> ~[hive-exec-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
> at
> org.apache.hadoop.hive.ql.exec.spark.HiveReduceFunctionResultList.processNextRecord(HiveReduceFunctionResultList.java:54)
> ~[hive-exec-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
> at
> org.apache.hadoop.hive.ql.exec.spark.HiveReduceFunctionResultList.processNextRecord(HiveReduceFunctionResultList.java:28)
> ~[hive-exec-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
> at
> org.apache.hadoop.hive.ql.exec.spark.HiveBaseFunctionResultList.hasNext(HiveBaseFunctionResultList.java:85)
> ~[hive-exec-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
> at
> scala.collection.convert.Wrappers$JIteratorWrapper.hasNext(Wrappers.scala:42)
> ~[scala-library-2.11.8.jar:?]
> at scala.collection.Iterator$class.foreach(Iterator.scala:893)
> ~[scala-library-2.11.8.jar:?]
> at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
> ~[scala-library-2.11.8.jar:?]
> at
> org.apache.spark.rdd.AsyncRDDActions$$anonfun$foreachAsync$1$$anonfun$apply$12.apply(AsyncRDDActions.scala:127)
> ~[spark-core_2.11-2.0.0.jar:2.0.0]
> at
> org.apache.spark.rdd.AsyncRDDActions$$anonfun$foreachAsync$1$$anonfun$apply$12.apply(AsyncRDDActions.scala:127)
> ~[spark-core_2.11-2.0.0.jar:2.0.0]
> at
> org.apache.spark.SparkContext$$anonfun$33.apply(SparkContext.scala:1974)
> ~[spark-core_2.11-2.0.0.jar:2.0.0]
> at
> org.apache.spark.SparkContext$$anonfun$33.apply(SparkContext.scala:1974)
> ~[spark-core_2.11-2.0.0.jar:2.0.0]
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
> ~[spark-core_2.11-2.0.0.jar:2.0.0]
> at org.apache.spark.scheduler.Task.run(Task.scala:85)
> ~[spark-core_2.11-2.0.0.jar:2.0.0]
> at
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
> ~[spark-core_2.11-2.0.0.jar:2.0.0]
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> [?:1.8.0_112]
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> [?:1.8.0_112]
> at java.lang.Thread.run(Thread.java:745) [?:1.8.0_112]
> Caused by: java.lang.ArrayIndexOutOfBoundsException: 1
> at
> org.apache.hadoop.hive.ql.exec.vector.VectorGroupKeyHelper.copyGroupKey(VectorGroupKeyHelper.java:107)
> ~[hive-exec-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
> at
> org.apache.hadoop.hive.ql.exec.vector.VectorGroupByOperator$ProcessingModeReduceMergePartial.doProcessBatch(VectorGroupByOperator.java:832)
> ~[hive-exec-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
> at
> org.apache.hadoop.hive.ql.exec.vector.VectorGroupByOperator$ProcessingModeBase.processBatch(VectorGroupByOperator.java:179)
> ~[hive-exec-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
> at
> org.apache.hadoop.hive.ql.exec.vector.VectorGroupByOperator.process(VectorGroupByOperator.java:1035)
> ~[hive-exec-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
> at
> org.apache.hadoop.hive.ql.exec.spark.SparkReduceRecordHandler.processVectors(SparkReduceRecordHandler.java:400)
> ~[hive-exec-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
> ... 17 more
> 2017-06-05T09:20:31,472 ERROR [Executor task launch worker-0]
> executor.Executor: Exception in task 2.0 in stage 1.0 (TID 8)
> java.lang.RuntimeException: org.apache.hadoop.hive.ql.metadata.HiveException:
> Error while processing vector batch (tag=0) Column vector types: 0:BYTES,
> 1:BYTES
> ["2008-04-08", "2008-04-08"]
> at
> org.apache.hadoop.hive.ql.exec.spark.SparkReduceRecordHandler.processRow(SparkReduceRecordHandler.java:315)
> ~[hive-exec-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
> at
> org.apache.hadoop.hive.ql.exec.spark.HiveReduceFunctionResultList.processNextRecord(HiveReduceFunctionResultList.java:54)
> ~[hive-exec-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
> at
> org.apache.hadoop.hive.ql.exec.spark.HiveReduceFunctionResultList.processNextRecord(HiveReduceFunctionResultList.java:28)
> ~[hive-exec-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
> at
> org.apache.hadoop.hive.ql.exec.spark.HiveBaseFunctionResultList.hasNext(HiveBaseFunctionResultList.java:85)
> ~[hive-exec-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
> at
> scala.collection.convert.Wrappers$JIteratorWrapper.hasNext(Wrappers.scala:42)
> ~[scala-library-2.11.8.jar:?]
> at scala.collection.Iterator$class.foreach(Iterator.scala:893)
> ~[scala-library-2.11.8.jar:?]
> at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
> ~[scala-library-2.11.8.jar:?]
> at
> org.apache.spark.rdd.AsyncRDDActions$$anonfun$foreachAsync$1$$anonfun$apply$12.apply(AsyncRDDActions.scala:127)
> ~[spark-core_2.11-2.0.0.jar:2.0.0]
> at
> org.apache.spark.rdd.AsyncRDDActions$$anonfun$foreachAsync$1$$anonfun$apply$12.apply(AsyncRDDActions.scala:127)
> ~[spark-core_2.11-2.0.0.jar:2.0.0]
> at
> org.apache.spark.SparkContext$$anonfun$33.apply(SparkContext.scala:1974)
> ~[spark-core_2.11-2.0.0.jar:2.0.0]
> at
> org.apache.spark.SparkContext$$anonfun$33.apply(SparkContext.scala:1974)
> ~[spark-core_2.11-2.0.0.jar:2.0.0]
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
> ~[spark-core_2.11-2.0.0.jar:2.0.0]
> at org.apache.spark.scheduler.Task.run(Task.scala:85)
> ~[spark-core_2.11-2.0.0.jar:2.0.0]
> at
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
> ~[spark-core_2.11-2.0.0.jar:2.0.0]
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> [?:1.8.0_112]
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> [?:1.8.0_112]
> at java.lang.Thread.run(Thread.java:745) [?:1.8.0_112]
> Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Error while
> processing vector batch (tag=0) Column vector types: 0:BYTES, 1:BYTES
> ["2008-04-08", "2008-04-08"]
> at
> org.apache.hadoop.hive.ql.exec.spark.SparkReduceRecordHandler.processVectors(SparkReduceRecordHandler.java:413)
> ~[hive-exec-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
> at
> org.apache.hadoop.hive.ql.exec.spark.SparkReduceRecordHandler.processRow(SparkReduceRecordHandler.java:301)
> ~[hive-exec-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
> ... 16 more
> Caused by: java.lang.ArrayIndexOutOfBoundsException: 1
> at
> org.apache.hadoop.hive.ql.exec.vector.VectorGroupKeyHelper.copyGroupKey(VectorGroupKeyHelper.java:107)
> ~[hive-exec-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
> at
> org.apache.hadoop.hive.ql.exec.vector.VectorGroupByOperator$ProcessingModeReduceMergePartial.doProcessBatch(VectorGroupByOperator.java:832)
> ~[hive-exec-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
> at
> org.apache.hadoop.hive.ql.exec.vector.VectorGroupByOperator$ProcessingModeBase.processBatch(VectorGroupByOperator.java:179)
> ~[hive-exec-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
> at
> org.apache.hadoop.hive.ql.exec.vector.VectorGroupByOperator.process(VectorGroupByOperator.java:1035)
> ~[hive-exec-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
> at
> org.apache.hadoop.hive.ql.exec.spark.SparkReduceRecordHandler.processVectors(SparkReduceRecordHandler.java:400)
> ~[hive-exec-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
> at
> org.apache.hadoop.hive.ql.exec.spark.SparkReduceRecordHandler.processRow(SparkReduceRecordHandler.java:301)
> ~[hive-exec-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
> ... 16 more
> 2017-06-05T09:20:31,488 DEBUG [dispatcher-event-loop-2]
> scheduler.TaskSchedulerImpl: parentName: , name: TaskSet_1, runningTasks: 0
> 2017-06-05T09:20:31,493 WARN [task-result-getter-0]
> scheduler.TaskSetManager: Lost task 2.0 in stage 1.0 (TID 8, localhost):
> java.lang.RuntimeException: org.apache.hadoop.hive.ql.metadata.HiveException:
> Error while processing vector batch (tag=0) Column vector types: 0:BYTES,
> 1:BYTES
> ["2008-04-08", "2008-04-08"]
> at
> org.apache.hadoop.hive.ql.exec.spark.SparkReduceRecordHandler.processRow(SparkReduceRecordHandler.java:315)
> at
> org.apache.hadoop.hive.ql.exec.spark.HiveReduceFunctionResultList.processNextRecord(HiveReduceFunctionResultList.java:54)
> at
> org.apache.hadoop.hive.ql.exec.spark.HiveReduceFunctionResultList.processNextRecord(HiveReduceFunctionResultList.java:28)
> at
> org.apache.hadoop.hive.ql.exec.spark.HiveBaseFunctionResultList.hasNext(HiveBaseFunctionResultList.java:85)
> at
> scala.collection.convert.Wrappers$JIteratorWrapper.hasNext(Wrappers.scala:42)
> at scala.collection.Iterator$class.foreach(Iterator.scala:893)
> at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
> at
> org.apache.spark.rdd.AsyncRDDActions$$anonfun$foreachAsync$1$$anonfun$apply$12.apply(AsyncRDDActions.scala:127)
> at
> org.apache.spark.rdd.AsyncRDDActions$$anonfun$foreachAsync$1$$anonfun$apply$12.apply(AsyncRDDActions.scala:127)
> at
> org.apache.spark.SparkContext$$anonfun$33.apply(SparkContext.scala:1974)
> at
> org.apache.spark.SparkContext$$anonfun$33.apply(SparkContext.scala:1974)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
> at org.apache.spark.scheduler.Task.run(Task.scala:85)
> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Error while
> processing vector batch (tag=0) Column vector types: 0:BYTES, 1:BYTES
> ["2008-04-08", "2008-04-08"]
> at
> org.apache.hadoop.hive.ql.exec.spark.SparkReduceRecordHandler.processVectors(SparkReduceRecordHandler.java:413)
> at
> org.apache.hadoop.hive.ql.exec.spark.SparkReduceRecordHandler.processRow(SparkReduceRecordHandler.java:301)
> ... 16 more
> Caused by: java.lang.ArrayIndexOutOfBoundsException: 1
> at
> org.apache.hadoop.hive.ql.exec.vector.VectorGroupKeyHelper.copyGroupKey(VectorGroupKeyHelper.java:107)
> at
> org.apache.hadoop.hive.ql.exec.vector.VectorGroupByOperator$ProcessingModeReduceMergePartial.doProcessBatch(VectorGroupByOperator.java:832)
> at
> org.apache.hadoop.hive.ql.exec.vector.VectorGroupByOperator$ProcessingModeBase.processBatch(VectorGroupByOperator.java:179)
> at
> org.apache.hadoop.hive.ql.exec.vector.VectorGroupByOperator.process(VectorGroupByOperator.java:1035)
> at
> org.apache.hadoop.hive.ql.exec.spark.SparkReduceRecordHandler.processVectors(SparkReduceRecordHandler.java:400)
> ... 17 more
> 2017-06-05T09:20:31,495 ERROR [task-result-getter-0]
> scheduler.TaskSetManager: Task 2 in stage 1.0 failed 1 times; aborting job
> {code}
> This exception happens in this line of VectorGroupKeyHelper.java:
> {code}
> BytesColumnVector outputColumnVector = (BytesColumnVector)
> outputBatch.cols[columnIndex];
> {code}
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)