jbewing opened a new pull request, #14852:
URL: https://github.com/apache/iceberg/pull/14852

   ### What
   
   This PR adds vectorized read support to Iceberg for [BYTE_STREAM_SPLIT 
encoding](https://parquet.apache.org/docs/file-format/data-pages/encodings/#byte-stream-split-byte_stream_split--9)
 of the Apache Parquet v2 specification (see 
https://github.com/apache/iceberg/issues/7162). This builds on top of the 
existing support for reading 
[DELTA_BINARY_PACKED](https://parquet.apache.org/docs/file-format/data-pages/encodings/#delta-encoding-delta_binary_packed--5)
 implemented by @eric-maynard in https://github.com/apache/iceberg/pull/13391 
with:
   - Implementing vectorized read support for [BYTE_STREAM_SPLIT 
encoding](https://parquet.apache.org/docs/file-format/data-pages/encodings/#byte-stream-split-byte_stream_split--9)
    - Bolstering golden files test coverage to cover each of the paths above. 
In addition, I added golden file tests that include rows with null values for 
each data type as well to ensure our handling of those is correct
   
   This is split out from https://github.com/apache/iceberg/pull/14800 to make 
reviewing the changes easier / facilitate tighter feedback cycles per some 
feedback on the [Iceberg Developer Slack 
Community](https://apache-iceberg.slack.com/archives/C03LG1D563F/p1765466132290829).
 
   
   ### Background
   
   This solves a longstanding issue of: the reference Apache Iceberg Spark 
implementation with the default settings enabled (e.g. 
`spark.sql.iceberg.vectorization.enabled` = `true`) isn't able to read iceberg 
tables that may have been written by other compute engines that utilize the not 
so new anymore Apache Parquet v2 writer specification. It's a widely known 
workaround to need to disable the vectorized reader in Spark if you need to 
interop with other compute engines or adjust all compute engines to use the 
Apache Parquet v1 writer specification when writing parquet files. With 
disabling the vectorization flag, clients take a performance hit that we've 
anecdotally measured is _quite_ large in some cases/workloads. If forcing all 
writers of an iceberg table to write in Apache Parquet v1 format, clients are 
incurring additional performance and storage penalties (files written with 
parquet v2 tend to be smaller than those written with the v1 spec as the new 
encodings tend to save spa
 ce and are often faster to read/write). So really, it's a lose-lose for 
performance & data size in the current setup with the additional papercut of 
Apache Iceberg not being super portable across engines in it's default setup. 
This PR seeks to solve that by finishing the swing on implementing vectorized 
parquet read support for the v2 format. In the future, we may also consider 
allowing clients to write Apache Parquet v2 files natively gated via a setting 
from Apache Iceberg. Even longer down that road, we may even consider changing 
that to be the "default" setting.
   
   ### Previous Work / Thanks
   
   This PR is a revival + extension to the work that @eric-maynard was doing in 
https://github.com/apache/iceberg/pull/13709. That PR had been active for a 
little while, so I literally started from where Eric left off. Thank you for 
the great work here @eric-maynard, you made implementing the rest of the 
changes required for vectorized read support _way_ easier!
   
   
   ### Note to Reviewers
   
   This is a split of the `BYTE_SPLIT_STREAM` encoding support from 
https://github.com/apache/iceberg/pull/14800.
   
   ### Testing
   I've tested this on a fork of Spark 3.5 & Iceberg 1.10.0 and verified that a 
Spark job is able to read a table written with Parquet V2 writer without issues.
   
   
   Issue: https://github.com/apache/iceberg/issues/7162
   Split out from: https://github.com/apache/iceberg/pull/14800
   
   cc @nastra 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to