mzheng-plaid commented on issue #11708:
URL: https://github.com/apache/hudi/issues/11708#issuecomment-2263671911
Yes, the parquet file itself is corrupted. Trying to read the parquet file
segfaults:
```
❯ RUST_BACKTRACE=1 pqrs cat ./xxx.parquet --json | jq '.model_output' | sort
| uniq -c
```
Trying to read with `spark.read.format("parquet").load("xxx.parquet")` fails
as expected (regardless of `spark.sql.parquet.enableVectorizedReader`, I tried
with it set to `false`):
```
Caused by: org.apache.parquet.io.ParquetDecodingException: Can't read value
in column [foo] optional float foo at value 204757 out of 463825, 4757 out of
20000 in currentPage. repetition level: 0, definition level: 1
at
org.apache.parquet.column.impl.ColumnReaderBase.readValue(ColumnReaderBase.java:553)
at
org.apache.parquet.column.impl.ColumnReaderImpl.readValue(ColumnReaderImpl.java:30)
at
org.apache.parquet.column.impl.ColumnReaderBase.writeCurrentValueToConverter(ColumnReaderBase.java:439)
at
org.apache.parquet.column.impl.ColumnReaderImpl.writeCurrentValueToConverter(ColumnReaderImpl.java:30)
at
org.apache.parquet.io.RecordReaderImplementation.read(RecordReaderImplementation.java:406)
at
org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:229)
... 19 more
Caused by: java.lang.ArrayIndexOutOfBoundsException
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]