Meeting notes:

Attendees:


Rok: contributor to Arrow, encryption, Rust

Gabor: Dremio, topic: Variant.

Fokko: Databricks

Dan: Databricks, topic: Variant Geo types

Kenny: hyparquet (js)

Gene: Databricks, topic: Variant

Andrew: Influx Data, rust parquet maintainer, data fusion. topic: Variant
in RUST

Ashish: Sumo Logic, listen in

Micah: Google

Neil: Snowflake, variant C++

Ryan: Databricks, topic:  variant, geo

Aihua: Snowflake, topic:  variant

Dewey: topic: PR open Geometry (C++, RUST)

Nong: Databricks

Agenda/Notes:

   -

   Geo types:
   -

      Geo implementations:
      -

         C++: https://github.com/apache/arrow/pull/45459
         -

         Java: https://github.com/apache/parquet-java/pull/2971
         -

      Update

- Geometry

- Geography: Stats TBD

- Java:

   -

   Christian and Fend have been working on the java implementation


   -

   Need a release


   -

   Fuzz testing
   -

   Getting a lot of feedback. Thanks!


   -

   Definition of the stats: in thrift with clear language.
   -

      Enable bounding box that go over the 0 line. (Fiji)
      -

      Don’t want stats that lie. Bad stats, bad data


   -

   Variant
   -

      Rust impl: https://github.com/apache/arrow-rs/issues/6736
      -

      Need: Unblock variant annotation in the java library
      -

      Finalize outstanding discussions
      -

         Versioning in Variant annotation => action item
         -

         What’s remaining to finalize the spec.
         -

            C++ and Java implementations
            -

               Java impl in iceberg, moving to Parquet
               -

               Impls:
               -

                  2 working java implementation
                  -

                  Spark Java implementation

<https://github.com/apache/spark/tree/master/common/variant/src/main/java/org/apache/spark/types/variant>
                  (binary, shredding)
                  -

                  Spark Python implementation

<https://github.com/apache/spark/blob/master/python/pyspark/sql/variant_utils.py>
                  (binary)
                  -

                  parquet-java implementation PR
                  <https://github.com/apache/parquet-java/pull/3117>
                  (binary)
                  -

                  C++ impl <https://github.com/apache/arrow/pull/45375>
                  -

                  2 private ones (Snowflake, Databricks(c++, binary,
                  shredding) )
                  -

            Lower priority: How to shred?
            -

               You cannot add columns after you instantiate the writer.
               -

               Could extend writer but collides with encryption
               -

               Adding columns for parquet schema in the middle of writing
               invalidates encryption
               -

            Shredding released at the same time as the binary variant.
            -

               Dangerous to do shredding as a follow up
               -

         Tiny PR for the spec: GH-486: Variant object shredding without
         field shredding <https://github.com/apache/parquet-format/pull/487>
         -

      Compatibility across implementations => Action item
      -

      Goal:
      -

         Combined Variant and shredding release
         -

            Do we require support for shredding?
            -

            Variant with shredding is not a separate type.
            -

            Did we agree to roll them out together?
            -

               We agree that we want to roll out together to reduce
               potential inconsistencies in implementations. => Action item
               -

         Requirements for considering it ready to release:
         -

            Need examples data for parquet data.
            -

         Versioning of variant spec
         -

            https://github.com/apache/parquet-format/pull/474



Action items

   - [image: unchecked]

   Julien, Ryan, Micah, Aihua: Follow up on email thread on the
   parquet-format type annotation for shredding, how we make it easy to work
   on implementation without fuzzy communication on releases
   - [image: unchecked]

   Andrew: follow up on the cross implementation testing
   - [image: unchecked]

   Micah, Ryan, Dan: Finalize type annotation versioning discussion on PR
   474
   -

   [image: unchecked]Ryan, email about decision to release sharedding with
   Variant.


On Tue, Mar 4, 2025 at 6:13 PM Julien Le Dem <jul...@apache.org> wrote:

> The next Parquet sync is tomorrow Mar 5th at 9:30am PT - 12:30pm ET -
> 6:30pm CET
> To join the invite:
> https://calendar.app.google/WTQgodyxSmBUimXT8
> Please contact me to be added to the recurring invite. (every two weeks)
> Everybody is welcome, bring your topic or just listen in.
> Best
> Julien
>

Reply via email to