Jan 8th 2025
Attendees: - Julien: Datadog, interested in updates - Micah: Google - Andrew L: InfluxData, lurking - Antoine: QuantStack, curious about Parquet 3 updates, Parquet C++ updates - Russell: Snowflake, Listen to Micah talk to me about shredding, Geometry - Alkis: DataBricks, - Ashish: Sumo Logic, listen in - Dewey: Wherobots, update on Geometry data type - Daniel: Databricks, Variant shredding, Geometry - Rok: listening in, footer - Andrew B: point cloud in Parquet Agenda: - Parquet C++, quick update => Antoine - New footer: quick update from Alkis - Variant Shredding - Geometry types => Dewey - Iceberg Geometry/Geography type PR: https://github.com/apache/iceberg/pull/10981 - Parquet Geometry type PR: https://github.com/apache/parquet-format/pull/240 Notes: - Parquet C++, quick update => Antoine - Gang Wu implemented new statistics in Cpp. (slight overhead: off by default) - https://github.com/apache/arrow/pull/40594 - Goal to improve performance so that it can be enabled by default. - https://github.com/apache/arrow/pull/45202 - Extension types: - Would like to revive the extension types proposal in the future (interest from Micah and Antoine) - New footer update: - Some metadata has been removed in the prototype (for compactness). Some readers need that. - Ex: Doesn’t have the converted types. - early version of experimental flatbuf footer: https://github.com/apache/arrow/pull/43793 - Variant Shredding: - Outstanding issues: - Micah: Specification is somewhat arbitrary on how to handle invalid Parquet files (whether shredded or unshredded columns take precedence) - Doesn’t really belong in the spec. - Reader behavior should be unspecified if the writer generates an invalid file. - Whether you read the shredded or unshredded data depends on the query. So saying one takes precedence is not really feasible in a performant way. - Russel agrees with leaving it as undefined. - Julien agrees. - Shredding happens in a single pass. It allows shredding a column that is not all the same type so that we don’t need to backtrack. - In meeting: - consensus that leaving it undefined is better. - We should error out in invalid cases as much as we can. - The spec should be very clear on invalid things that should not be allowed. - TODO: Daniel to follow up with Ryan, to wrap this up. - What other implementation of Variant do we need to finalize? - Parquet java - Another different language: C++ (arrow/cpp), Rust (arrow-rs), Go…? - Discussion in Arrow to have the Variant extension type? - https://github.com/apache/arrow/issues/42069 - Separate nascent effort of Iceberg C++ - Ticket tracking adding variant in Rust (arrow-rs): https://github.com/apache/arrow-rs/issues/6736 - Enabler: - Produce data files to enable cross-compatibility tests. - TODO(Daniel W): follow with Fokko on leading a rust implementation. - Iceberg rust uses the parquet implementation from arrow-rs: https://github.com/apache/iceberg-rust/blob/6e07faacd7734886718ce544e40599eb2ce939e3/Cargo.toml#L79 - TODO(Daniel W): explore the feasibility of C++ implementation of unshredded Variant. - TODO(Daniel W): follow up with Ryan Blue for a plan for the non-java (arrow/cpp or arrow-rs) implementation and follow up on mailing list On Wed, Jan 8, 2025 at 7:38 AM Julien Le Dem <jul...@apache.org> wrote: > The next Parquet sync is today Jan 8th at 9:30am PT - 12:30pm ET - 6:30pm > CET > To join the invite: > https://calendar.app.google/uTqCRtdDFMAGttwY8 > Please contact me to be added to the recurring invite. > Everybody is welcome, bring your topic or just listen in. > Best > Julien >