Cheng Lian created PARQUET-893:
----------------------------------
Summary: GroupColumnIO.getFirst() doesn't check for empty groups
Key: PARQUET-893
URL: https://issues.apache.org/jira/browse/PARQUET-893
Project: Parquet
Issue Type: Bug
Components: parquet-mr
Affects Versions: 1.8.1
Reporter: Cheng Lian
The following Spark 2.1 snippet reproduces this issue:
{code}
import org.apache.spark.sql.types._
val path = "/tmp/parquet-test"
case class Inner(f00: Int)
case class Outer(f0: Inner, f1: Int)
val df = Seq(Outer(Inner(1), 1)).toDF()
df.printSchema()
// root
// |-- f0: struct (nullable = true)
// | |-- f00: integer (nullable = false)
// |-- f1: integer (nullable = false)
df.write.mode("overwrite").parquet(path)
val requestedSchema =
new StructType().
add("f0", new StructType().
// This nested field name differs from the original one
add("f01", IntegerType)).
add("f1", IntegerType)
println(requestedSchema.treeString)
// root
// |-- f0: struct (nullable = true)
// | |-- f01: integer (nullable = true)
// |-- f1: integer (nullable = true)
spark.read.schema(requestedSchema).parquet(path).show()
{code}
In the above snippet, {{requestedSchema}} is compatible with the schema of the
written Parquet file, but the following exception is thrown:
{noformat}
org.apache.parquet.io.ParquetDecodingException: Can not read value at 0 in
block -1 in file
file:/tmp/parquet-test/part-00007-d2b0bec1-7be5-4b51-8d53-3642680bc9c2.snappy.parquet
at
org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:243)
at
org.apache.parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:227)
at
org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:39)
at
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:109)
at
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:184)
at
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:109)
at
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
Source)
at
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377)
at
org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:231)
at
org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:225)
at
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:826)
at
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:826)
at
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:99)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.IndexOutOfBoundsException: Index: 0, Size: 0
at java.util.ArrayList.rangeCheck(ArrayList.java:653)
at java.util.ArrayList.get(ArrayList.java:429)
at org.apache.parquet.io.GroupColumnIO.getFirst(GroupColumnIO.java:102)
at org.apache.parquet.io.GroupColumnIO.getFirst(GroupColumnIO.java:102)
at
org.apache.parquet.io.PrimitiveColumnIO.getFirst(PrimitiveColumnIO.java:102)
at
org.apache.parquet.io.PrimitiveColumnIO.isFirst(PrimitiveColumnIO.java:97)
at
org.apache.parquet.io.RecordReaderImplementation.<init>(RecordReaderImplementation.java:277)
at
org.apache.parquet.io.MessageColumnIO$1.visit(MessageColumnIO.java:135)
at
org.apache.parquet.io.MessageColumnIO$1.visit(MessageColumnIO.java:101)
at
org.apache.parquet.filter2.compat.FilterCompat$NoOpFilter.accept(FilterCompat.java:154)
at
org.apache.parquet.io.MessageColumnIO.getRecordReader(MessageColumnIO.java:101)
at
org.apache.parquet.hadoop.InternalParquetRecordReader.checkRead(InternalParquetRecordReader.java:140)
at
org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:214)
... 21 more
{noformat}
According to this stack trace, it seems that {{GroupColumnIO.getFirst()}}
[doesn't check for empty
groups|https://github.com/apache/parquet-mr/blob/apache-parquet-1.8.2/parquet-column/src/main/java/org/apache/parquet/io/GroupColumnIO.java#L103]
properly.
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)