[GitHub] [iceberg] RussellSpitzer opened a new issue #2692: [Spark] Opaque error when attempting to do vectorized read of Parquet file with unsupported encoding

GitBox Thu, 10 Jun 2021 12:26:45 -0700


RussellSpitzer opened a new issue #2692:
URL: https://github.com/apache/iceberg/issues/2692



   When parquet vectorized reading is on, if the file has DELTA_BYTE_ARRAY 
encoding that we throw a null pointer exception.
   
   ```
   org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
stage 1.0 failed 1 times, most recent failure: Lost task 0.0 in stage 1.0 (TID 
2) (macbook-pro.attlocal.net executor driver): java.lang.NullPointerException
        at 
org.apache.iceberg.arrow.vectorized.parquet.BaseVectorizedParquetValuesReader.readUnsignedVarInt(BaseVectorizedParquetValuesReader.java:137)
        at 
org.apache.iceberg.arrow.vectorized.parquet.BaseVectorizedParquetValuesReader.readNextGroup(BaseVectorizedParquetValuesReader.java:187)
        at 
org.apache.iceberg.arrow.vectorized.parquet.VectorizedParquetDefinitionLevelReader.readBatchVarWidth(VectorizedParquetDefinitionLevelReader.java:714)
        at 
org.apache.iceberg.arrow.vectorized.parquet.VectorizedPageIterator.nextBatchVarWidthType(VectorizedPageIterator.java:393)
        at 
org.apache.iceberg.arrow.vectorized.parquet.VectorizedColumnIterator.nextBatchVarWidthType(VectorizedColumnIterator.java:182)
        at 
org.apache.iceberg.arrow.vectorized.VectorizedArrowReader.read(VectorizedArrowReader.java:148)
        at 
org.apache.iceberg.spark.data.vectorized.ColumnarBatchReader.read(ColumnarBatchReader.java:70)
        at 
org.apache.iceberg.spark.data.vectorized.ColumnarBatchReader.read(ColumnarBatchReader.java:39)
        at 
org.apache.iceberg.parquet.VectorizedParquetReader$FileIterator.next(VectorizedParquetReader.java:134)
        at 
org.apache.iceberg.spark.source.BaseDataReader.next(BaseDataReader.java:88)
        at 
org.apache.spark.sql.execution.datasources.v2.PartitionIterator.hasNext(DataSourceRDD.scala:79)
        at 
org.apache.spark.sql.execution.datasources.v2.MetricsIterator.hasNext(DataSourceRDD.scala:112)
        at 
org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
        at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
        at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.columnartorow_nextBatch_0$(Unknown
 Source)
        at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
 Source)
        at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
        at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:755)
        at 
org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:345)
        at 
org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:898)
        at 
org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:898)
        at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
        at org.apache.spark.scheduler.Task.run(Task.scala:131)
        at 
org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497)
        at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:500)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)
   ```
   
   This can occur when Parquet files are added through non-spark frameworks 
like Trino or by manually adding files to a table using the java api.
   
   Spark's Vectorized reader also does not support this encoding but throws a 
clearer error.
   
   ```
   java.lang.UnsupportedOperationException: Unsupported encoding: 
DELTA_BYTE_ARRAY
        at 
org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.initDataReader(VectorizedColumnReader.java:783)
   ```
   
   Here is the metadata for one of the files in question:
   ```
   file:                  
file:/Users/russellspitzer/Temp/vector/data/c7cd9613-1349-466b-90ba-24c98b0e3722.parquet
   creator:               null
   
   file schema:           table
   
--------------------------------------------------------------------------------
   id:                    OPTIONAL BINARY L:STRING R:0 D:1
   ts: OPTIONAL INT64 R:0 D:1
   s_id:                OPTIONAL BINARY L:STRING R:0 D:1
   
   row group 1:           RC:1 TS:165 OFFSET:4
   
--------------------------------------------------------------------------------
   id:                     BINARY GZIP DO:0 FPO:4 SZ:53/35/0.66 VC:1 
ENC:DELTA_BYTE_ARRAY ST:[min: 1, max: 1, num_nulls: 0]
   ts:                     INT64 GZIP DO:0 FPO:57 SZ:54/34/0.63 VC:1 
ENC:DELTA_BINARY_PACKED ST:[min: 1619809949087, max: 1619809949087, num_nulls: 
0]
   s_id:                   BINARY GZIP DO:0 FPO:111 SZ:58/40/0.69 VC:1 
ENC:DELTA_BYTE_ARRAY ST:[min: 708546, max: 708546, num_nulls: 0]
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [iceberg] RussellSpitzer opened a new issue #2692: [Spark] Opaque error when attempting to do vectorized read of Parquet file with unsupported encoding

Reply via email to