jbewing opened a new pull request, #14852: URL: https://github.com/apache/iceberg/pull/14852
### What This PR adds vectorized read support to Iceberg for [BYTE_STREAM_SPLIT encoding](https://parquet.apache.org/docs/file-format/data-pages/encodings/#byte-stream-split-byte_stream_split--9) of the Apache Parquet v2 specification (see https://github.com/apache/iceberg/issues/7162). This builds on top of the existing support for reading [DELTA_BINARY_PACKED](https://parquet.apache.org/docs/file-format/data-pages/encodings/#delta-encoding-delta_binary_packed--5) implemented by @eric-maynard in https://github.com/apache/iceberg/pull/13391 with: - Implementing vectorized read support for [BYTE_STREAM_SPLIT encoding](https://parquet.apache.org/docs/file-format/data-pages/encodings/#byte-stream-split-byte_stream_split--9) - Bolstering golden files test coverage to cover each of the paths above. In addition, I added golden file tests that include rows with null values for each data type as well to ensure our handling of those is correct This is split out from https://github.com/apache/iceberg/pull/14800 to make reviewing the changes easier / facilitate tighter feedback cycles per some feedback on the [Iceberg Developer Slack Community](https://apache-iceberg.slack.com/archives/C03LG1D563F/p1765466132290829). ### Background This solves a longstanding issue of: the reference Apache Iceberg Spark implementation with the default settings enabled (e.g. `spark.sql.iceberg.vectorization.enabled` = `true`) isn't able to read iceberg tables that may have been written by other compute engines that utilize the not so new anymore Apache Parquet v2 writer specification. It's a widely known workaround to need to disable the vectorized reader in Spark if you need to interop with other compute engines or adjust all compute engines to use the Apache Parquet v1 writer specification when writing parquet files. With disabling the vectorization flag, clients take a performance hit that we've anecdotally measured is _quite_ large in some cases/workloads. If forcing all writers of an iceberg table to write in Apache Parquet v1 format, clients are incurring additional performance and storage penalties (files written with parquet v2 tend to be smaller than those written with the v1 spec as the new encodings tend to save spa ce and are often faster to read/write). So really, it's a lose-lose for performance & data size in the current setup with the additional papercut of Apache Iceberg not being super portable across engines in it's default setup. This PR seeks to solve that by finishing the swing on implementing vectorized parquet read support for the v2 format. In the future, we may also consider allowing clients to write Apache Parquet v2 files natively gated via a setting from Apache Iceberg. Even longer down that road, we may even consider changing that to be the "default" setting. ### Previous Work / Thanks This PR is a revival + extension to the work that @eric-maynard was doing in https://github.com/apache/iceberg/pull/13709. That PR had been active for a little while, so I literally started from where Eric left off. Thank you for the great work here @eric-maynard, you made implementing the rest of the changes required for vectorized read support _way_ easier! ### Note to Reviewers This is a split of the `BYTE_SPLIT_STREAM` encoding support from https://github.com/apache/iceberg/pull/14800. ### Testing I've tested this on a fork of Spark 3.5 & Iceberg 1.10.0 and verified that a Spark job is able to read a table written with Parquet V2 writer without issues. Issue: https://github.com/apache/iceberg/issues/7162 Split out from: https://github.com/apache/iceberg/pull/14800 cc @nastra -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
