[
https://issues.apache.org/jira/browse/SPARK-16334?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15363438#comment-15363438
]
Vladimir Ivanov edited comment on SPARK-16334 at 7/13/16 1:51 AM:
------------------------------------------------------------------
Hi, we discovered problem with the same stacktrace in Spark 2.0. In our case
it's thrown during DataFrame.rdd.aggregate call. Moreover it somehow depends on
volume of data, because it is not thrown when we change filter criteria
accordingly. We used SparkSQL to write these parquet files and didn't
explicitly specify WriterVersion option so I believe whatever version is set by
default was used.
was (Author: vivanov):
Hi, we discovered problem with the same stacktrace in Spark 2.0. In our case
it's thrown during DataFrame.rdd call. Moreover it somehow depends on volume of
data, because it is not thrown when we change filter criteria accordingly. We
used SparkSQL to write these parquet files and didn't explicitly specify
WriterVersion option so I believe whatever version is set by default was used.
> [SQL] SQL query on parquet table java.lang.ArrayIndexOutOfBoundsException
> -------------------------------------------------------------------------
>
> Key: SPARK-16334
> URL: https://issues.apache.org/jira/browse/SPARK-16334
> Project: Spark
> Issue Type: Bug
> Affects Versions: 2.0.0
> Reporter: Egor Pahomov
> Priority: Critical
> Labels: sql
>
> Query:
> {code}
> select * from blabla where user_id = 415706251
> {code}
> Error:
> {code}
> 16/06/30 14:07:27 WARN scheduler.TaskSetManager: Lost task 11.0 in stage 0.0
> (TID 3, hadoop6): java.lang.ArrayIndexOutOfBoundsException: 6934
> at
> org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainBinaryDictionary.decodeToBinary(PlainValuesDictionary.java:119)
> at
> org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.decodeDictionaryIds(VectorizedColumnReader.java:273)
> at
> org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readBatch(VectorizedColumnReader.java:170)
> at
> org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextBatch(VectorizedParquetRecordReader.java:230)
> at
> org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextKeyValue(VectorizedParquetRecordReader.java:137)
> at
> org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:36)
> at
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:91)
> at
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.scan_nextBatch$(Unknown
> Source)
> at
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
> Source)
> at
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
> at
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
> at
> org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:246)
> at
> org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:240)
> at
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:780)
> at
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:780)
> at
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
> at org.apache.spark.scheduler.Task.run(Task.scala:85)
> at
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
> {code}
> Work on 1.6.1
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]