Attendees:
-
Alkis: Databricks storage and IO. goals: make Parquet metadata better
for wide schemas and in general.
-
Get pr in on parquet-benchmark
-
Extensions PR in review
-
Review ongoing footer experiments.
-
Micah: Google. Listen in
-
Rok: freelancing for fintech. Solving a problem related to encryption.
Nothing to discuss yet. Interested in the wide schema tables. Started
pushing for donating wide footer for parquet-benchmark.
-
Julien: Datadog. Interested in metadata improvements
-
Ashish: listening in.
-
Gene: Databricks, main contributor to the Variant work. Topic: Where to
put the spec?
Agenda:
-
Ongoing Metadata tasks
-
*Review Alkis’s footer experiments*
-
*Variant type*
Notes:
-
Ongoing Metadata tasks:
-
Get pr in on parquet-benchmark:
https://github.com/apache/parquet-benchmark/pull/1
-
Action Items:
-
Micah: last review.
-
Julien: Review and merge.
-
Extensions PR in review:
https://github.com/apache/parquet-format/pull/254
-
Goal to end the vote by the end of the week.
-
Minimum 3 binding votes.
-
Action Items:
-
Micah: last review and vote
-
*Review Alkis’s footer experiments*:
https://github.com/apache/arrow/pull/43793
-
Standard Google C++ benchmark:
-
Add footer
-
Convert footer
-
Verify footer
-
Measure:
-
Make sure we don’t blow up the metadata
-
Overhead of adding the new footer when not reading it.
-
Collecting telemetry in Databricks to have more information on size
of metadata (can we use smaller ints for sizes?).
-
Can we limit the max size of a row group?
-
Hierarchical definition in metadata?
-
Move encodings to not be in footer but only with pages.
-
General agreement on this
-
Action Items:
-
Alkis: will start a google doc from the benchmark to discuss the
optimizations that are more controversial.
-
Discuss limiting row group size to int 32
-
Discuss removing stats from the footer or have two layers of
footer.
-
*Variant type.*
-
Arrow or Parquet are good hosting projects for that. It looks like
Parquet makes more consensus.
-
Logical encoding with fairly complex spec.
-
Need separate jar.
-
Consumable by ORC, Avro, …
-
Currently: Spark holds the spec and the code
-
Contribute Spec first: collect comments and make changes
-
Code takes a little longer: need to refactor to separate from Spark
-
There will be more than one implementation (or even more than one
JVM impl)
-
Parquet Cpp
-
Current arrow impl combines IO and allocation in the library
-
Would be better to have a separate lib that does not have IO nor
allocation.
-
Follow up:
-
Gene to start a google doc to form a plan and will share on the
thread.
Links:
-
Issue on compat testing:
https://github.com/apache/parquet-format/issues/441
-
On Wed, Aug 28, 2024 at 9:00 AM Julien Le Dem <[email protected]> wrote:
> The next Parquet Sync is happening today at 9:30am PT - 12:30pm ET -
> 6:30pm CET
> (in 30min)
> To join the invite:
> https://calendar.app.google/61H58BfhTbY82tuZ6
> Everybody is welcome, bring your topic or just listen in.
> Best
> Julien
>