[
https://issues.apache.org/jira/browse/SPARK-54863?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
陈磊 updated SPARK-54863:
-----------------------
Attachment: zhengchang.png
> The parquet file written in Spark 3.4 is damaged and there are no exception
> logs
> --------------------------------------------------------------------------------
>
> Key: SPARK-54863
> URL: https://issues.apache.org/jira/browse/SPARK-54863
> Project: Spark
> Issue Type: Bug
> Components: Input/Output
> Affects Versions: 3.4.0
> Reporter: 陈磊
> Priority: Major
> Attachments: zhengchang.png, zhengchangbu.png
>
>
> This week, there have been one instance of damaged Parquet files when using
> SparkSQL 3.4 to write Parquet.
> The first exception that occurred was writing business data incorrectly to
> the position indicating the length of the field data, resulting in abnormal
> data reading.
> The normal 'created_tm' field has four bits of length, which are [19,0,0,0].
> The 'created_tm' field of the error record has four digits of 'length', which
> are [19,0,50,48].
> Throw this exception during execution:
> {code:java}
> Caused by: org.apache.parquet.io.ParquetDecodingException: Can not read value
> at 1771814 in block 4 in file
> hdfs://ns22034/user/dd_edw/odm.db/odm_jdr_sch_d03_sku_pop_sku_pk_act_da/dp=5883/00000002-628c-4456-a893-f9ea6b31122b-0_11164-0-11164_20251221141817486.parquet
> at
> org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:264)
> at org.apache.parquet.hadoop.ParquetReader.read(ParquetReader.java:132)
> at
> org.apache.hudi.common.util.ParquetReaderIterator.hasNext(ParquetReaderIterator.java:48)
> ... 32 more
> Caused by: org.apache.parquet.io.ParquetDecodingException: Can't read value
> in column [created_tm] optional binary created_tm (STRING) at value 267595
> out of 716627, 7595 out of 20000 in currentPage. repetition level: 0,
> definition level: 1
> at
> org.apache.parquet.column.impl.ColumnReaderBase.readValue(ColumnReaderBase.java:553)
> at
> org.apache.parquet.column.impl.ColumnReaderImpl.readValue(ColumnReaderImpl.java:30)
> at
> org.apache.parquet.column.impl.ColumnReaderBase.writeCurrentValueToConverter(ColumnReaderBase.java:439)
> at
> org.apache.parquet.column.impl.ColumnReaderImpl.writeCurrentValueToConverter(ColumnReaderImpl.java:30)
> at
> org.apache.parquet.io.RecordReaderImplementation.read(RecordReaderImplementation.java:406)
> at
> org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:234)
> ... 34 more
> Caused by: org.apache.parquet.io.ParquetDecodingException: could not read
> bytes at offset 174666
> at
> org.apache.parquet.column.values.plain.BinaryPlainValuesReader.readBytes(BinaryPlainValuesReader.java:42)
> at
> org.apache.parquet.column.impl.ColumnReaderBase$2$6.read(ColumnReaderBase.java:372)
> at
> org.apache.parquet.column.impl.ColumnReaderBase.readValue(ColumnReaderBase.java:533)
> ... 39 more
> Caused by: java.io.EOFException
> at
> org.apache.parquet.bytes.SingleBufferInputStream.slice(SingleBufferInputStream.java:116)
> at
> org.apache.parquet.column.values.plain.BinaryPlainValuesReader.readBytes(BinaryPlainValuesReader.java:40)
> ... 41 more
> {code}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]