Notes Parquet Sync Aug 28th

Julien Le Dem Wed, 28 Aug 2024 10:52:38 -0700

Attendees:

   -


   Alkis: Databricks storage and IO. goals: make Parquet metadata better
   for wide schemas and in general.
   -

      Get pr in on parquet-benchmark
      -

      Extensions PR in review
      -

      Review ongoing footer experiments.
      -

   Micah: Google. Listen in
   -

   Rok: freelancing for fintech. Solving a problem related to encryption.
   Nothing to discuss yet. Interested in the wide schema tables. Started
   pushing for donating wide footer for parquet-benchmark.
   -

   Julien: Datadog. Interested in metadata improvements
   -

   Ashish: listening in.
   -

   Gene: Databricks, main contributor to the Variant work. Topic: Where to
   put the spec?


Agenda:

   -

   Ongoing Metadata tasks
   -

   *Review Alkis’s footer experiments*
   -

   *Variant type*


Notes:

   -

   Ongoing Metadata tasks:


   -

   Get pr in on parquet-benchmark:
   https://github.com/apache/parquet-benchmark/pull/1
   -

      Action Items:
      -

         Micah: last review.
         -

         Julien: Review and merge.
         -

   Extensions PR in review:
   https://github.com/apache/parquet-format/pull/254
   -

      Goal to end the vote by the end of the week.
      -

      Minimum 3 binding votes.
      -

      Action Items:
      -

         Micah: last review and vote


   -

   *Review Alkis’s footer experiments*:
   https://github.com/apache/arrow/pull/43793
   -

      Standard Google C++ benchmark:
      -

         Add footer
         -

         Convert footer
         -

         Verify footer
         -

      Measure:
      -

         Make sure we don’t blow up the metadata
         -

         Overhead of adding the new footer when not reading it.
         -

      Collecting telemetry in Databricks to have more information on size
      of metadata (can we use smaller ints for sizes?).
      -

         Can we limit the max size of a row group?
         -

      Hierarchical definition in metadata?
      -

         Move encodings to not be in footer but only with pages.
         -

            General agreement on this
            -

      Action Items:
      -

         Alkis: will start a google doc from the benchmark to discuss the
         optimizations that are more controversial.
         -

            Discuss limiting row group size to int 32
            -

            Discuss removing stats from the footer or have two layers of
            footer.
            -

   *Variant type.*
   -

      Arrow or Parquet are good hosting projects for that. It looks like
      Parquet makes more consensus.
      -

      Logical encoding with fairly complex spec.
      -

         Need separate jar.
         -

            Consumable by ORC, Avro, …
            -

      Currently: Spark holds the spec and the code
      -

         Contribute Spec first: collect comments and make changes
         -

         Code takes a little longer: need to refactor to separate from Spark
         -

         There will be more than one implementation (or even more than one
         JVM impl)
         -

      Parquet Cpp
      -

         Current arrow impl combines IO and allocation in the library
         -

         Would be better to have a separate lib that does not have IO nor
         allocation.
         -

      Follow up:
      -

         Gene to start a google doc to form a plan and will share on the
         thread.


Links:

   -

   Issue on compat testing:
   https://github.com/apache/parquet-format/issues/441
   -


On Wed, Aug 28, 2024 at 9:00 AM Julien Le Dem <[email protected]> wrote:

> The next Parquet Sync is happening today at 9:30am PT - 12:30pm ET -
> 6:30pm CET
> (in 30min)
> To join the invite:
> https://calendar.app.google/61H58BfhTbY82tuZ6
> Everybody is welcome, bring your topic or just listen in.
> Best
> Julien
>

Notes Parquet Sync Aug 28th

Reply via email to