[
https://issues.apache.org/jira/browse/SPARK-50457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17902206#comment-17902206
]
Raunaq Morarka commented on SPARK-50457:
----------------------------------------
[~sunchao] [~dennishuo] could you please take a look at this ?
> Failure when reading a parquet file with DELTA_LENGTH_BYTE_ARRAY encoding
> -------------------------------------------------------------------------
>
> Key: SPARK-50457
> URL: https://issues.apache.org/jira/browse/SPARK-50457
> Project: Spark
> Issue Type: Bug
> Components: SQL
> Affects Versions: 3.4.2
> Reporter: Raunaq Morarka
> Priority: Major
> Attachments:
> 20241129_061036_00012_bfrcc_7f44c8f0-479e-4d6a-8c77-a28e88af8db7
>
>
> Encountered the below failure when trying to read a parquet file that uses
> DELTA_LENGTH_BYTE_ARRAY encoding for BINARY values
> {code:java}
> Caused by: org.apache.parquet.io.ParquetDecodingException: Failed to read 24
> bytes
> at
> org.apache.spark.sql.execution.datasources.parquet.VectorizedDeltaLengthByteArrayReader.readBinary(VectorizedDeltaLengthByteArrayReader.java:63)
> at
> org.apache.spark.sql.execution.datasources.parquet.ParquetVectorUpdaterFactory$BinaryUpdater.readValues(ParquetVectorUpdaterFactory.java:729)
> at
> org.apache.spark.sql.execution.datasources.parquet.VectorizedRleValuesReader.readBatchInternal(VectorizedRleValuesReader.java:244)
> at
> org.apache.spark.sql.execution.datasources.parquet.VectorizedRleValuesReader.readBatch(VectorizedRleValuesReader.java:176)
> at
> org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readBatch(VectorizedColumnReader.java:252)
> at
> org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextBatch(VectorizedParquetRecordReader.java:328)
> at
> org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextKeyValue(VectorizedParquetRecordReader.java:219)
> at
> org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:39)
> at
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:125)
> at
> org.apache.spark.sql.execution.FileSourceScanExec$$anon$1.hasNext(DataSourceScanExec.scala:594)
> at
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.columnartorow_nextBatch_0$(Unknown
> Source)
> at
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
> Source)
> at
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
> at
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:760)
> at
> org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:388)
> at
> org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:888)
> at
> org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:888)
> at
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:364)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:328)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:92)
> at
> org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:161)
> at org.apache.spark.scheduler.Task.run(Task.scala:139) {code}
> DDL
> {code:java}
> CREATE TABLE default.lineitem (
> l_orderkey BIGINT,
> l_partkey BIGINT,
> l_suppkey BIGINT,
> l_linenumber INT,
> l_quantity DECIMAL(12,2),
> l_extendedprice DECIMAL(12,2),
> l_discount DECIMAL(12,2),
> l_tax DECIMAL(12,2),
> l_returnflag VARCHAR(1),
> l_linestatus VARCHAR(1),
> l_shipdate DATE,
> l_commitdate DATE,
> l_receiptdate DATE,
> l_shipinstruct VARCHAR(25),
> l_shipmode VARCHAR(10),
> l_comment VARCHAR(44))
> USING parquet {code}
> The same file was read successfully by Apache Hive, Trino and parquet-cli
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]