Attendees: -
Micah: Google - Ryan: Databricks, Variant - Andrew: Influx - Gene: Databricks, Variant - Ashish - Alkis: Databricks, Metadata v3 - Aihua: Snowflake, Variant - Rok: Fintech - Riza: Cloudera, Impala - Steve: Cloudera - Julien: Datadog, discussion around Metadata v3 + Variant Agenda: - Variant type - Moving Variant to the Parquet Project <https://docs.google.com/document/d/1guEzBQjzOEEZvvibeZjNraKmZHWtxQR95O_DvtZU0xw/edit#heading=h.5ad5xy8ox6bp> - Overview - Spec in /parquet-format - Java impl in /parquet-java - Need to rapidly release changes - Have a build scoped to variant in parquet-java, to iterate faster - Disclaimer on the spec to start with (work in progress) - Need to define some logical types - Next steps: - Gene: Open a PR on /parquet-format with disclaimer on work in progress - Next: parquet-java implementation. TODO: figure out actual build delineation - Eventually we will vote to remove the disclaimer and make it official - Alkis: update on Metadata v3 - New improvements since a PR was opened with benchmarks 0/amazon_polarity: num-rgs=900 num-cols=3 thrift=1049k flatbuf=230k packed=139k 1/amazon_reviews_books: num-rgs=159 num-cols=43 thrift=750k flatbuf=240k packed=160k 2/cmrc2018: num-rgs=4 num-cols=10 thrift=16k flatbuf=3.8k packed=2.6k 3/dbr-fleet-example-0: num-rgs=4 num-cols=2950 thrift=2.1M flatbuf=1035k packed=709k 4/dbr-fleet-example-1: num-rgs=1 num-cols=2987 thrift=818k flatbuf=554k packed=420k 5/everyday_conversations: num-rgs=3 num-cols=12 thrift=14k flatbuf=5.2k packed=3.1k - Perf improvement to thrift needs review: https://github.com/apache/thrift/pull/3037 - We need committers to respond timely to PRs in https://github.com/apache/parquet-benchmark/ - Possibly Daniel Weeks, Nong, Ryan Blue can help - For reference: - https://www.influxdata.com/blog/how-good-parquet-wide-tables/ — TLDR in the Rust implementation at least there is at least a 4x improvement that could be had with no format changes, just software engineering - https://www.vldb.org/pvldb/vol16/p2769-durner.pdf is a great read about how to size object store requests On Wed, Sep 25, 2024 at 8:00 AM Julien Le Dem <jul...@apache.org> wrote: > The Parquet Sync is happening today at 9:30am PT - 12:30pm ET - 6:30pm CET > (in 90 mins) > To join the invite: > https://calendar.app.google/uM78Qf3YiTAaPm5g8 > > Everybody is welcome, bring your topic or just listen in. > Best > Julien >