Parquet Sync Notes July 31th 2024

Julien Le Dem Wed, 31 Jul 2024 15:46:57 -0700

Attendees:

   -

Micah: Google, no special topic today
-

Alkis: Databricks, storage stack. Topic: Parquet extension PR so that we
can go in the format. Want to fix the metadata to make it work for wide
schemas.
-

Vinoo: Palantir -> startup in data space. Working on improving the
website.
-

Julien: Datadog. Topic: Make parquet reading possible to be done
sequentially (as opposed to footer first)
-

Rok: Voltron -> freelance in Fintech. Care about Parquet performance.
Have time to contribute to footers (“V3”).

Follow up items:

Mika’s Parquet format changes process

First PR merged, need to finalize java
-

=> Mostly done

Jira -> github migration

Getting started with github. Will follow up on the mailing list.
-

=> mostly closed discussion. Some follow up async on the discussion.

Agenda:

Finalizing [EXTERNAL] Parquet extensions

<https://docs.google.com/document/d/1KkoR0DjzYnLQXO-d0oRBv2k157IZU0_injqd4eV4WiI/edit#heading=h.15ohoov5qqm6>

AI: Alkis Evlogimenos <alkis.evlogime...@databricks.com> to update PR
with everything in the doc except Alternatives Considered and split the
examples in another page.
-

New footer metadata discussion.

Discussion:

Extensions:
-

Add functionality to read/write the extension and show that we can
ignore it.
-

1: write an extension and read the old footer that ignores it.
-

2: write extension and allow reading it back.
-

New metadata:
-

Flatbuffer is bigger than thrift: need to optimize metadata
-

Start from a 1-1 implementation to existing footer and keep
iterating 1 commit at a time.
-

Would like to have a branch in github arrow cpp or a public fork on
github to share the prototype.
-

Add to parquet-tool to print the footer.
-

Add utility to obfuscate schema so that people can share their
metadata without sharing proprietary information.
-

That way we can have data about slow footers and validate we can
read faster with the new footer
-

=> creation of a database of footers.
-

Getting a feel of what features are used by users.
-

Alkis would want to share his findings through a blog post.
-

Also need to make sure the addition of the new footer doesn’t impact
old footers too much.
-

Possibly:
-

Codspeed for performance testing
-

Thrift linter: https://github.com/thrift-labs/thrift-fmt
-

AI:
-

[Julien] Create a parquet-benchmark repo for a footer db and other
things
-

Example: https://github.com/rok/parquet-benchmark
-

Alkis to pick where on github to push his prototype branch
-

Follow up on:
-

https://github.com/apache/parquet-format/pull/445

Parquet Sync Notes July 31th 2024

Reply via email to