RussellSpitzer opened a new issue #2692:
URL: https://github.com/apache/iceberg/issues/2692
When parquet vectorized reading is on, if the file has DELTA_BYTE_ARRAY
encoding that we throw a null pointer exception.
```
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in
stage 1.0 failed 1 times, most recent failure: Lost task 0.0 in stage 1.0 (TID
2) (macbook-pro.attlocal.net executor driver): java.lang.NullPointerException
at
org.apache.iceberg.arrow.vectorized.parquet.BaseVectorizedParquetValuesReader.readUnsignedVarInt(BaseVectorizedParquetValuesReader.java:137)
at
org.apache.iceberg.arrow.vectorized.parquet.BaseVectorizedParquetValuesReader.readNextGroup(BaseVectorizedParquetValuesReader.java:187)
at
org.apache.iceberg.arrow.vectorized.parquet.VectorizedParquetDefinitionLevelReader.readBatchVarWidth(VectorizedParquetDefinitionLevelReader.java:714)
at
org.apache.iceberg.arrow.vectorized.parquet.VectorizedPageIterator.nextBatchVarWidthType(VectorizedPageIterator.java:393)
at
org.apache.iceberg.arrow.vectorized.parquet.VectorizedColumnIterator.nextBatchVarWidthType(VectorizedColumnIterator.java:182)
at
org.apache.iceberg.arrow.vectorized.VectorizedArrowReader.read(VectorizedArrowReader.java:148)
at
org.apache.iceberg.spark.data.vectorized.ColumnarBatchReader.read(ColumnarBatchReader.java:70)
at
org.apache.iceberg.spark.data.vectorized.ColumnarBatchReader.read(ColumnarBatchReader.java:39)
at
org.apache.iceberg.parquet.VectorizedParquetReader$FileIterator.next(VectorizedParquetReader.java:134)
at
org.apache.iceberg.spark.source.BaseDataReader.next(BaseDataReader.java:88)
at
org.apache.spark.sql.execution.datasources.v2.PartitionIterator.hasNext(DataSourceRDD.scala:79)
at
org.apache.spark.sql.execution.datasources.v2.MetricsIterator.hasNext(DataSourceRDD.scala:112)
at
org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
at
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.columnartorow_nextBatch_0$(Unknown
Source)
at
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
Source)
at
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at
org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:755)
at
org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:345)
at
org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:898)
at
org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:898)
at
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.run(Task.scala:131)
at
org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:500)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
```
This can occur when Parquet files are added through non-spark frameworks
like Trino or by manually adding files to a table using the java api.
Spark's Vectorized reader also does not support this encoding but throws a
clearer error.
```
java.lang.UnsupportedOperationException: Unsupported encoding:
DELTA_BYTE_ARRAY
at
org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.initDataReader(VectorizedColumnReader.java:783)
```
Here is the metadata for one of the files in question:
```
file:
file:/Users/russellspitzer/Temp/vector/data/c7cd9613-1349-466b-90ba-24c98b0e3722.parquet
creator: null
file schema: table
--------------------------------------------------------------------------------
id: OPTIONAL BINARY L:STRING R:0 D:1
ts: OPTIONAL INT64 R:0 D:1
s_id: OPTIONAL BINARY L:STRING R:0 D:1
row group 1: RC:1 TS:165 OFFSET:4
--------------------------------------------------------------------------------
id: BINARY GZIP DO:0 FPO:4 SZ:53/35/0.66 VC:1
ENC:DELTA_BYTE_ARRAY ST:[min: 1, max: 1, num_nulls: 0]
ts: INT64 GZIP DO:0 FPO:57 SZ:54/34/0.63 VC:1
ENC:DELTA_BINARY_PACKED ST:[min: 1619809949087, max: 1619809949087, num_nulls:
0]
s_id: BINARY GZIP DO:0 FPO:111 SZ:58/40/0.69 VC:1
ENC:DELTA_BYTE_ARRAY ST:[min: 708546, max: 708546, num_nulls: 0]
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]