Notes Parquet sync 9/25/24

Julien Le Dem Wed, 25 Sep 2024 16:06:12 -0700

Attendees:

   -


   Micah: Google
   -

   Ryan: Databricks, Variant
   -

   Andrew: Influx
   -

   Gene: Databricks, Variant
   -

   Ashish
   -

   Alkis: Databricks, Metadata v3
   -

   Aihua: Snowflake, Variant
   -

   Rok: Fintech
   -

   Riza: Cloudera, Impala
   -

   Steve: Cloudera
   -

   Julien: Datadog, discussion around Metadata v3 + Variant


Agenda:

   -

   Variant type
   -

      Moving Variant to the Parquet Project
      
<https://docs.google.com/document/d/1guEzBQjzOEEZvvibeZjNraKmZHWtxQR95O_DvtZU0xw/edit#heading=h.5ad5xy8ox6bp>
      -

         Overview
         -

            Spec in /parquet-format
            -

            Java impl in /parquet-java
            -

      Need to rapidly release changes
      -

         Have a build scoped to variant in parquet-java, to iterate faster
         -

         Disclaimer on the spec to start with (work in progress)
         -

         Need to define some logical types
         -

      Next steps:
      -

         Gene: Open a PR on /parquet-format with disclaimer on work in
         progress
         -

         Next: parquet-java implementation. TODO: figure out actual build
         delineation
         -

         Eventually we will vote to remove the disclaimer and make it
         official
         -

   Alkis: update on Metadata v3
   -

      New improvements since a PR was opened with benchmarks


0/amazon_polarity: num-rgs=900 num-cols=3 thrift=1049k flatbuf=230k
packed=139k

1/amazon_reviews_books: num-rgs=159 num-cols=43 thrift=750k flatbuf=240k
packed=160k

2/cmrc2018: num-rgs=4 num-cols=10 thrift=16k flatbuf=3.8k packed=2.6k

3/dbr-fleet-example-0: num-rgs=4 num-cols=2950 thrift=2.1M flatbuf=1035k
packed=709k

4/dbr-fleet-example-1: num-rgs=1 num-cols=2987 thrift=818k flatbuf=554k
packed=420k

5/everyday_conversations: num-rgs=3 num-cols=12 thrift=14k flatbuf=5.2k
packed=3.1k

   -

   Perf improvement to thrift needs review:
   https://github.com/apache/thrift/pull/3037
   -

   We need committers to respond timely to PRs in
   https://github.com/apache/parquet-benchmark/
   -

      Possibly Daniel Weeks, Nong, Ryan Blue can help
      -

   For reference:
   -

      https://www.influxdata.com/blog/how-good-parquet-wide-tables/ — TLDR
      in the Rust implementation at least there is at least a 4x
improvement that
      could be had with no format changes, just software engineering
      -

      https://www.vldb.org/pvldb/vol16/p2769-durner.pdf is a great read
      about how to size object store requests



On Wed, Sep 25, 2024 at 8:00 AM Julien Le Dem <[email protected]> wrote:

> The Parquet Sync is happening today at 9:30am PT - 12:30pm ET - 6:30pm CET
> (in 90 mins)
> To join the invite:
> https://calendar.app.google/uM78Qf3YiTAaPm5g8
>
> Everybody is welcome, bring your topic or just listen in.
> Best
> Julien
>

Notes Parquet sync 9/25/24

Reply via email to