Proposal: Parquet footer size in Iceberg metadata

Sreeram Garlapati Tue, 21 Jan 2025 12:17:28 -0800

Hello Team!

This is a small improvement proposal to store the *parquet footer size* as
part of the *data_file* metadata in the iceberg manifest
<https://iceberg.apache.org/spec/#manifests>.
*manifest_entry   >   (2) data_file  >  (146 Optional) footer_size_in_bytes*


*Motivation*:

   - We have several sub-second read use cases on iceberg tables. We store
   icebergs and parquets on S3. Every hop to S3 is v.expensive (P99 of >200
   milliseconds). Hence we are trying to see if we can optimize by cutting
   down any of these hops. One such hop is during the Parquet file read., the
   first read to the parquet, which is to read the last 8 bytes - to read the
   - footer size and par1 sequence.
   - Iceberg metadata already includes the file_size_in_bytes. Including
   the footer size benefits all the readers. ie., readers can directly issue 1
   I/O call to read the footer - *read_parquet_footer(filehandle,
   offset=file_size_in_bytes-footer_size_in_bytes-1)*
   - This is similar to what we have in the iceberg specification in the
   case of storing Table statistics
   <https://iceberg.apache.org/spec/#table-statistics>, puffins >
   *file-footer-size-in-bytes*.
   - This can be easily extended to ORC as needed too. Perhaps, in the ORC
   case, an additional property to store the postscript length is also needed.

Truly appreciate your thoughts,
Sreeram <https://www.linkedin.com/in/sreeramgarlapati>

Proposal: Parquet footer size in Iceberg metadata

Reply via email to