陈磊 created SPARK-54863:
--------------------------

             Summary: The parquet file written in Spark 3.4 is damaged and 
there are no exception logs
                 Key: SPARK-54863
                 URL: https://issues.apache.org/jira/browse/SPARK-54863
             Project: Spark
          Issue Type: Bug
          Components: Input/Output
    Affects Versions: 3.4.0
            Reporter: 陈磊


This week, there have been one instance of damaged Parquet files when using 
SparkSQL 3.4 to write Parquet.

The first exception that occurred was writing business data incorrectly to the 
position indicating the length of the field data, resulting in abnormal data 
reading.

The normal 'created_tm' field has four bits of length, which are [19,0,0,0].

The 'created_tm' field of the error record has four digits of 'length', which 
are [19,0,50,48].
Throw this exception during execution:

{code:java}
Caused by: org.apache.parquet.io.ParquetDecodingException: Can not read value 
at 1771814 in block 4 in file 
hdfs://ns22034/user/dd_edw/odm.db/odm_jdr_sch_d03_sku_pop_sku_pk_act_da/dp=5883/00000002-628c-4456-a893-f9ea6b31122b-0_11164-0-11164_20251221141817486.parquet
        at 
org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:264)
        at org.apache.parquet.hadoop.ParquetReader.read(ParquetReader.java:132)
        at 
org.apache.hudi.common.util.ParquetReaderIterator.hasNext(ParquetReaderIterator.java:48)
        ... 32 more
Caused by: org.apache.parquet.io.ParquetDecodingException: Can't read value in 
column [created_tm] optional binary created_tm (STRING) at value 267595 out of 
716627, 7595 out of 20000 in currentPage. repetition level: 0, definition 
level: 1
        at 
org.apache.parquet.column.impl.ColumnReaderBase.readValue(ColumnReaderBase.java:553)
        at 
org.apache.parquet.column.impl.ColumnReaderImpl.readValue(ColumnReaderImpl.java:30)
        at 
org.apache.parquet.column.impl.ColumnReaderBase.writeCurrentValueToConverter(ColumnReaderBase.java:439)
        at 
org.apache.parquet.column.impl.ColumnReaderImpl.writeCurrentValueToConverter(ColumnReaderImpl.java:30)
        at 
org.apache.parquet.io.RecordReaderImplementation.read(RecordReaderImplementation.java:406)
        at 
org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:234)
        ... 34 more
Caused by: org.apache.parquet.io.ParquetDecodingException: could not read bytes 
at offset 174666
        at 
org.apache.parquet.column.values.plain.BinaryPlainValuesReader.readBytes(BinaryPlainValuesReader.java:42)
        at 
org.apache.parquet.column.impl.ColumnReaderBase$2$6.read(ColumnReaderBase.java:372)
        at 
org.apache.parquet.column.impl.ColumnReaderBase.readValue(ColumnReaderBase.java:533)
        ... 39 more
Caused by: java.io.EOFException
        at 
org.apache.parquet.bytes.SingleBufferInputStream.slice(SingleBufferInputStream.java:116)
        at 
org.apache.parquet.column.values.plain.BinaryPlainValuesReader.readBytes(BinaryPlainValuesReader.java:40)
        ... 41 more
{code}




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to