[PR] Add golden file tests for vectorized Parquet reads [iceberg]

via GitHub Wed, 02 Jul 2025 13:59:08 -0700


eric-maynard opened a new pull request, #13450:
URL: https://github.com/apache/iceberg/pull/13450

During the implementation of new Parquet encodings (e.g. #13391) I've
noticed that we rely on generating Parquet data at test time. For some
encodings, such as DELTA_BYTE_ARRAY, that is complicated by the fact that
there's not a good way to reliably tell the writer to use a particular encoding
for a particular field.

To address this gap, this PR introduces a new test `testGoldenFiles` along
with several pre-generated Parquet files written using various encodings. I
intend to add more files/encodings here as support for new encodings is
introduced.

I generated these files using [this small
util](https://github.com/eric-maynard/parquet-golden-files) and manually
validated the encodings with `parquet-tools`, e.g.:

```
$ parquet-tools inspect --detail
~/iceberg/spark/v4.0/spark/src/test/resources/encodings/RLE/int32.parquet
FileMetaData
. . .
■■■■■■■■■■■■■■■■■■■■■■■■encodings = list
■■■■■■■■■■■■■■■■■■■■■■■■■■■■0
■■■■■■■■■■■■■■■■■■■■■■■■■■■■3
■■■■■■■■■■■■■■■■■■■■■■■■■■■■8
. . .
```

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[PR] Add golden file tests for vectorized Parquet reads [iceberg]

Reply via email to