Jan 8th 2025

Attendees:

   -

   Julien: Datadog, interested in updates
   -

   Micah: Google
   -

   Andrew L: InfluxData, lurking
   -

   Antoine: QuantStack, curious about Parquet 3 updates, Parquet C++ updates
   -

   Russell: Snowflake, Listen to Micah talk to me about shredding, Geometry
   -

   Alkis: DataBricks,
   -

   Ashish: Sumo Logic, listen in
   -

   Dewey: Wherobots, update on Geometry data type
   -

   Daniel: Databricks, Variant shredding, Geometry
   -

   Rok: listening in, footer
   -

   Andrew B: point cloud in Parquet


Agenda:

   -

   Parquet C++, quick update => Antoine
   -

   New footer: quick update from Alkis
   -

   Variant Shredding
   -

   Geometry types => Dewey
   -

      Iceberg Geometry/Geography type PR:
      https://github.com/apache/iceberg/pull/10981
      -

      Parquet Geometry type PR:
      https://github.com/apache/parquet-format/pull/240


Notes:

   -

   Parquet C++, quick update => Antoine
   -

      Gang Wu implemented new statistics in Cpp. (slight overhead: off by
      default)
      -

         https://github.com/apache/arrow/pull/40594
         -

      Goal to improve performance so that it can be enabled by default.
      -

         https://github.com/apache/arrow/pull/45202
         -

   Extension types:
   -

      Would like to revive the extension types proposal in the future
      (interest from Micah and Antoine)
      -

   New footer update:
   -

      Some metadata has been removed in the prototype (for compactness).
      Some readers need that.
      -

         Ex: Doesn’t have the converted types.
         -

         early version of experimental flatbuf footer:
         https://github.com/apache/arrow/pull/43793
         -

   Variant Shredding:
   -

      Outstanding issues:
      -

         Micah: Specification is somewhat arbitrary on how to handle
         invalid Parquet files (whether shredded or unshredded columns take
         precedence)
         -

            Doesn’t really belong in the spec.
            -

            Reader behavior should be unspecified if the writer generates
            an invalid file.
            -

            Whether you read the shredded or unshredded data depends on the
            query. So saying one takes precedence is not really feasible in a
            performant way.
            -

         Russel agrees with leaving it as undefined.
         -

         Julien agrees.
         -

         Shredding happens in a single pass. It allows shredding a column
         that is not all the same type so that we don’t need to backtrack.
         -

         In meeting:
         -

            consensus that leaving it undefined is better.
            -

            We should error out in invalid cases as much as we can.
            -

            The spec should be very clear on invalid things that should not
            be allowed.
            -

         TODO: Daniel to follow up with Ryan, to wrap this up.
         -

      What other implementation of Variant do we need to finalize?
      -

         Parquet java
         -

         Another different language: C++ (arrow/cpp), Rust (arrow-rs), Go…?
         -

            Discussion in Arrow to have the Variant extension type?
            -

               https://github.com/apache/arrow/issues/42069
               -

            Separate nascent effort of Iceberg C++
            -

            Ticket tracking adding variant in Rust (arrow-rs):
            https://github.com/apache/arrow-rs/issues/6736
            -

         Enabler:
         -

            Produce data files to enable cross-compatibility tests.
            -

         TODO(Daniel W): follow with Fokko on leading a rust implementation.
         -

            Iceberg rust uses the parquet implementation from arrow-rs:
            
https://github.com/apache/iceberg-rust/blob/6e07faacd7734886718ce544e40599eb2ce939e3/Cargo.toml#L79
            -

         TODO(Daniel W): explore the feasibility of C++ implementation of
         unshredded Variant.
         -

         TODO(Daniel W): follow up with Ryan Blue for a plan for the
         non-java (arrow/cpp or arrow-rs) implementation and follow up
on mailing
         list


On Wed, Jan 8, 2025 at 7:38 AM Julien Le Dem <jul...@apache.org> wrote:

> The next Parquet sync is today Jan 8th at 9:30am PT - 12:30pm ET - 6:30pm
> CET
> To join the invite:
> https://calendar.app.google/uTqCRtdDFMAGttwY8
> Please contact me to be added to the recurring invite.
> Everybody is welcome, bring your topic or just listen in.
> Best
> Julien
>

Reply via email to