陈磊 created SPARK-54863:
--------------------------
Summary: The parquet file written in Spark 3.4 is damaged and
there are no exception logs
Key: SPARK-54863
URL: https://issues.apache.org/jira/browse/SPARK-54863
Project: Spark
Issue Type: Bug
Components: Input/Output
Affects Versions: 3.4.0
Reporter: 陈磊
This week, there have been one instance of damaged Parquet files when using
SparkSQL 3.4 to write Parquet.
The first exception that occurred was writing business data incorrectly to the
position indicating the length of the field data, resulting in abnormal data
reading.
The normal 'created_tm' field has four bits of length, which are [19,0,0,0].
The 'created_tm' field of the error record has four digits of 'length', which
are [19,0,50,48].
Throw this exception during execution:
{code:java}
Caused by: org.apache.parquet.io.ParquetDecodingException: Can not read value
at 1771814 in block 4 in file
hdfs://ns22034/user/dd_edw/odm.db/odm_jdr_sch_d03_sku_pop_sku_pk_act_da/dp=5883/00000002-628c-4456-a893-f9ea6b31122b-0_11164-0-11164_20251221141817486.parquet
at
org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:264)
at org.apache.parquet.hadoop.ParquetReader.read(ParquetReader.java:132)
at
org.apache.hudi.common.util.ParquetReaderIterator.hasNext(ParquetReaderIterator.java:48)
... 32 more
Caused by: org.apache.parquet.io.ParquetDecodingException: Can't read value in
column [created_tm] optional binary created_tm (STRING) at value 267595 out of
716627, 7595 out of 20000 in currentPage. repetition level: 0, definition
level: 1
at
org.apache.parquet.column.impl.ColumnReaderBase.readValue(ColumnReaderBase.java:553)
at
org.apache.parquet.column.impl.ColumnReaderImpl.readValue(ColumnReaderImpl.java:30)
at
org.apache.parquet.column.impl.ColumnReaderBase.writeCurrentValueToConverter(ColumnReaderBase.java:439)
at
org.apache.parquet.column.impl.ColumnReaderImpl.writeCurrentValueToConverter(ColumnReaderImpl.java:30)
at
org.apache.parquet.io.RecordReaderImplementation.read(RecordReaderImplementation.java:406)
at
org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:234)
... 34 more
Caused by: org.apache.parquet.io.ParquetDecodingException: could not read bytes
at offset 174666
at
org.apache.parquet.column.values.plain.BinaryPlainValuesReader.readBytes(BinaryPlainValuesReader.java:42)
at
org.apache.parquet.column.impl.ColumnReaderBase$2$6.read(ColumnReaderBase.java:372)
at
org.apache.parquet.column.impl.ColumnReaderBase.readValue(ColumnReaderBase.java:533)
... 39 more
Caused by: java.io.EOFException
at
org.apache.parquet.bytes.SingleBufferInputStream.slice(SingleBufferInputStream.java:116)
at
org.apache.parquet.column.values.plain.BinaryPlainValuesReader.readBytes(BinaryPlainValuesReader.java:40)
... 41 more
{code}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]