eric-maynard opened a new pull request, #13450: URL: https://github.com/apache/iceberg/pull/13450
During the implementation of new Parquet encodings (e.g. #13391) I've noticed that we rely on generating Parquet data at test time. For some encodings, such as DELTA_BYTE_ARRAY, that is complicated by the fact that there's not a good way to reliably tell the writer to use a particular encoding for a particular field. To address this gap, this PR introduces a new test `testGoldenFiles` along with several pre-generated Parquet files written using various encodings. I intend to add more files/encodings here as support for new encodings is introduced. I generated these files using [this small util](https://github.com/eric-maynard/parquet-golden-files) and manually validated the encodings with `parquet-tools`, e.g.: ``` $ parquet-tools inspect --detail ~/iceberg/spark/v4.0/spark/src/test/resources/encodings/RLE/int32.parquet FileMetaData . . . ■■■■■■■■■■■■■■■■■■■■■■■■encodings = list ■■■■■■■■■■■■■■■■■■■■■■■■■■■■0 ■■■■■■■■■■■■■■■■■■■■■■■■■■■■3 ■■■■■■■■■■■■■■■■■■■■■■■■■■■■8 . . . ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
