Re: Parquet sync today

Julien Le Dem Thu, 23 Jan 2025 17:15:39 -0800

Thank you for the email updates, we read them during the meeting, that was
quite useful.
Notes:


Attendees:

   -

   Gene - Databricks:
   -

   Micah - Google
   -

   Gabor - Dremio
   -

   Fokko - Databricks
   -

   Aihua - Snowflake
   -

   Raul - QuantStack
   -

   Neil - Snowflake
   -

   Kenny - HyperParam : author of HyParquet
   -

   Julien - Datadog
   -

   Antoine - QuantStack
   -

   Russell - Snowflake
   -

   Rok -


Agenda:

   -

   Variant update: Gene/Daniel/Andrew/Gabor/Fokko/Rok/Aihua
   -

      (email updates from people who could not attend)
      -

         Daniel:

With respect to the reference implementations for Variant, we had discussed
the possibility of Rust or C++, but those both have significant work.  The
Java and native Python implementations are much closer and should cover the
concerns for verification of the spec.  I still think there will be work on
the Rust side, but I don't think there's a C++ implementation that would be
in a state to open source. For the shredding spec, Micah, Ryan, Russel and
I met and are closing in on wording that everyone is happy with, so I
expect that will close out Shortly.

   -

   Andrew:

As a brief update, I am working on finding someone to help with the Rust
implementation of variant. Moving forward with Java and Python seems
reasonable to me, though I would truly love to get a Rust implementation to
ensure there is no potential gotcha's for a native implementation.

   -

   Gene update:
   -

      Variant binary encoding/decoding:
      -

         Java implementation under review in parquet-java repo
         -

         Python implementation in pyspark.
         -

            Pure-python in Spark repo.
            -

      Variant shredding:
      -

         Still working on the implementation in Spark.
         -

         What are the next steps?
         -

         TODO: follow up on the mailing list.
         -

   How are we releasing Variant?
   -

      Current plan: Release Variant + shredding together.
      -

      Releasing variant binary first would be a possibility.
      -

   Remaining questions:
   -

      New types added to the format
      -

         Nano timestamp: need clarification on actual semantics. A long
         cannot store a nanosecond timestamp with a practical range.
Year 9999 often
         used as a special value. (which does not fit in a long)
         -

         Avro, Arrow, and Iceberg have a 64bits nano ts.
         -

            Limited precision:
            -

               1677-09-21 00:12:43.145224193
               -

               2262-04-11 23:47:16.854775807].
               -

         Snowflake implementation: default to 8 byte, expandable to 16
         bytes.
         -

            Variant could support 16 bytes version
            -

         Step1:
         -

            Support in variant:
            -

               64bits micros ts
               -

               64bits micros ts without timezone
               -

               64bits nano ts
               -

               64bits nanos ts without timezone
               -

            Next step:
            -

               Gene to follow up with Russel, Ryan, Antoine
               -

         Step2:
         -

            Add pico seconds ts to Parquet. Define how it’s mapped to
            native types.
            -

         Constraints:
         -

            Number of type code
            -

            Using 20+ already (including: nano ts, time, UUID)
            -

            6 bits: 64 types maximum. (we might use the last to extend)
            -

         Interval: full range of SQL types.
         -

   Russell:
   -

      Someone at Snowflake will look into the parquet-cpp implementation of
      Variant.
      -

         Binary variant and the shredding
         -

   Antoine:
   -

      Do we have official Variant test cases to test various
      implementations?
      -

      It would be nice to provide a set of test cases for cross language
      compatibility.
      -

   Variant implementations:
   -

      Java implementation (https://github.com/apache/parquet-java/pull/3117)
      -

      python PR? https://github.com/apache/spark/pull/49591
      -

      There seem to be roundtrip tests against Spark here:
      
https://github.com/apache/spark/blob/54a59b7f3ceb575e478650ab8ead01922595ea17/python/pyspark/sql/tests/test_types.py#L2060



   -

   Wide schema performance problem: [Antoine] new footer
   -

      Interest in this work
      -

      Russel also interested.
      -

      Need to talk about encryption in the new footer.
      -

         Opportunity to improve encryption handling in the footer.
         -

      TODO: follow up with Alkis


On Wed, Jan 22, 2025 at 9:05 AM Andrew Lamb <andrewlam...@gmail.com> wrote:

> I also unfortunately will not be able to make it today.
>
> As a brief update, I am working on finding someone to help with the Rust
> implementation of variant. Moving forward with Java and Python seems
> reasonable to me, though I would truly love to get a Rust implementation to
> ensure there is no potential gotcha's for a native implementation
>
> Thanks,
> Andrew
>
> On Wed, Jan 22, 2025 at 11:41 AM Daniel Weeks <dwe...@apache.org> wrote:
>
> > Hey Julien,
> >
> > I'm not going to be able to attend today's meeting, but just wanted to
> > follow up on a few of the items from the last meeting.
> >
> > With respect to the reference implementations for Variant, we had
> > discussed the possibility of Rust or C++, but those both have significant
> > work.  The Java and native Python implementations are much closer and
> > should cover the concerns for verification of the spec.  I still think
> > there will be work on the Rust side, but I don't think there's a C++
> > implementation that would be in a state to open source.
> >
> > For the shredding spec, Micah, Ryan, Russel and I met and are closing in
> on
> > wording that everyone is happy with, so I expect that will close out
> > shortly.
> >
> > -Dan
> >
> > On Wed, Jan 22, 2025 at 7:41 AM Julien Le Dem <jul...@apache.org> wrote:
> >
> > > The next Parquet sync is today Jan 22nd at 9:30am PT - 12:30pm ET -
> > 6:30pm
> > > CET
> > > (in about 2hs)
> > > To join the invite:
> > > https://calendar.app.google/xXGgYU6evBArpzdZ9
> > > Please contact me to be added to the recurring invite.
> > > Everybody is welcome, bring your topic or just listen in.
> > > Best
> > > Julien
> > >
> >
>

Re: Parquet sync today

Reply via email to