Notes: https://docs.google.com/document/d/e/2PACX-1vSDHW7gvG8eO6aIxaIVPrZSqYYhtRDb5W1imnbpM4QRYNPsTwEO1fU5z7SEhVIFa4YqWJeSRJ9tcXYS/pub
Attendees: Apache Parquet Community Sync <[email protected]> - Julien: Datadog, progress updates (ALP, …) - Micah Kornfield: Databricks, just listening in - Kenny Daniel, Hyperparam, listening in - Neelesh Salian: Apple, listening in - Will Edwards: Spotify, listening in - Connor Tsui: Spiral, listening in - Benjamin Owad: Snowflake, listening in - Jiayi: Databricks, footer, non-contiguous pages, working group schedule - Alkis: Databricks, non-contiguous pages, file type, footer - Div: Databricks, non-contiguous pages, file type, footer - Andrew Lamb: InfluxData, just listening in - Burak: Databricks, file type - Steve Loughran, Cloudera, Variants - Rok Mihevc: G-Research/Arctos Alliance, vector-like datatype proposal - Ismaël Mejía, Microsoft, Java performance optimizations - Daniel Weeks: Databricks, footer, non-contiguous pages, file type, etc. - Fokko Driesprong: Databricks: New Types - Rahil Chertara: Onehouse, listening in Agenda/Notes: - Encodings progress: - ALP - [Prateek] - all comments addressed on the spec - Cpp same - Vinoo: Java impl with end-to-end test working <https://www.google.com/url?q=https://github.com/apache/parquet-java/pull/3397.&sa=D&source=editors&ust=1779321593156870&usg=AOvVaw0RbSKQiVcTHp-Cq8CNpnlt> - Cross language testing J<>Cpp almost done - Micah: make sure coverage on java unsigned int math is enough. - Andrew: Rust implementation is progressing: https://github.com/apache/arrow-rs/pull/9372 <https://www.google.com/url?q=https://github.com/apache/arrow-rs/pull/9372&sa=D&source=editors&ust=1779321593157670&usg=AOvVaw1rUQQXhnFat8_PCHv41W4g> - Incremental encoding/decoding for later. - Andrew: spec is in very good shape. Would appreciate another pair of eyes: https://github.com/apache/parquet-format/pull/557 <https://www.google.com/url?q=https://github.com/apache/parquet-format/pull/557&sa=D&source=editors&ust=1779321593158302&usg=AOvVaw31IOQQt9FEqiV_4sLf_grP> - TODO: move cpp to Pqt repo. (pending Antoine’s review). Can be done later. - Goal to vote on the mailing list within the next 2 weeks. - FSST - Arnav addressing Micah’s comments - TODO: Painpoint of performance on POC - Micah: Design question. Encoding values together. - TODO: Andrew to intro CWI to arnav for feedback. - Non-contiguous pages - Dan: doc <https://www.google.com/url?q=https://docs.google.com/document/d/1nntcYM98PFSkHT70RexSBPtCnWqg1uRJ5_7m--ZgbsA/edit?tab%3Dt.0%23heading%3Dh.k4r8orckhbx0&sa=D&source=editors&ust=1779321593159916&usg=AOvVaw0YGOIza6jzGnD8vpl3sARG> to review - Dictionary may need to have extra metadata to find it. - Page index only has data pages - Might need discussion on handling the incompatible change (PAR2) - Broader discussion on knowing the length of things in addition to where they start. (exemple: bloom filter) - TODO: - DAN: POC to validate feasibility - Everyone: please review the doc above - Java Performance optimizations: - Re organized PRs per encoding / improvement type <https://www.google.com/url?q=https://github.com/apache/parquet-java/issues/3530&sa=D&source=editors&ust=1779321593161627&usg=AOvVaw34vkbjT03G5eKMTF9WUtGS>, please help with review - Bypassing Hadoop compression abstractions = ~1.5x better (de)compression should we? - It is good to reduce dependency on Hadoop APIs. - En(de)codings improvements for the Spark Vectorized Parquet Reader + support for BSS <https://www.google.com/url?q=https://github.com/apache/spark/issues/56011&sa=D&source=editors&ust=1779321593162585&usg=AOvVaw1J6tn2Co5RtWbuyRgB9pfH> - TODO: - Please review PRs! - Discussion on deprecating LZO codec - File type - Burak: list thread: https://lists.apache.org/thread/od9hxfssjgnmsh23o18q78hszowq7pcy <https://www.google.com/url?q=https://lists.apache.org/thread/od9hxfssjgnmsh23o18q78hszowq7pcy&sa=D&source=editors&ust=1779321593163486&usg=AOvVaw2pgRoO8Z8UuGMQcBm8PTYP> - General need for supporting AI use cases - Related: - lancedb blog <https://www.google.com/url?q=https://www.lancedb.com/blog/lance-blob-v2&sa=D&source=editors&ust=1779321593163975&usg=AOvVaw3cKlHihGIIv6sqQyFMqwAQ> union of file reference vs embedded blob. - Inlining is connected to the “non-continuous pages” discussion. - Because big blobs create asymmetry in column size that gets addressed by non-contiguous pages. - Discussion items: - Should it be in Parquet? - Logical type vs extension? - Dan: opinion: logical type - Russel: need for interoperable semantics. - Are paths relative? - If reference counting, probably not in parquet - Russel: Cases: - Small inline blobs - Separate Column files with inline blobs - External Ref types - TODO: - Share a doc on the mailing list, get feedback from Antoine. - Footer - Parquet footer working doc <https://www.google.com/url?q=https://docs.google.com/document/d/1eiygLyg_jcEiPnF_F6XI01dmX2DDa6UjDJH9ZAJCKe8/edit?usp%3Dsharing&sa=D&source=editors&ust=1779321593166425&usg=AOvVaw2sMX_M-bb28_HDCy6Ji2jJ> - Jiayi: listed options in the docs and kicked off experiments. - Schedule meeting through mailing list - TODO: feedback on the doc above - Variant - Read optimizations merged in: https://github.com/apache/parquet-java/pull/3481 <https://www.google.com/url?q=https://github.com/apache/parquet-java/pull/3481&sa=D&source=editors&ust=1779321593167340&usg=AOvVaw2RTUFJqDhI16pIVPC8rEd2> - On-going hardening variant work: https://github.com/apache/parquet-java/pull/3562 <https://www.google.com/url?q=https://github.com/apache/parquet-java/pull/3562&sa=D&source=editors&ust=1779321593167734&usg=AOvVaw0T8EFf91neMeBWx9tJMjft> https://github.com/apache/parquet-testing/pull/113 <https://www.google.com/url?q=https://github.com/apache/parquet-testing/pull/113&sa=D&source=editors&ust=1779321593167990&usg=AOvVaw3sf5hm_MLtoOYrDIyV4cFv> - bug: parquet java + spark need to handle unsorted variant objects. - Steve: thanks Neelesh - Couple of bugs are being fixed, other implementers need to make sure they handle unsorted. => risk here. - Performance better but still a lot of work to do. - Micah: perf of rust impl? - Andrew: not yet heavily used. - Steve: good about validation of implem. - Neelesh: monthly Variants in Iceberg call -see the iceberg calendar - Vector-like type - Please read and comment the design doc <https://www.google.com/url?q=https://docs.google.com/document/d/1nf30OqK_UqxA4YTEZQszmOBEG56m9M5mp9rIYC2SUWc/edit?tab%3Dt.0&sa=D&source=editors&ust=1779321593169825&usg=AOvVaw1eSZGF4EYpCnmNObh1NO-6>, we still need to agree on an approach - C++ PoC <https://www.google.com/url?q=https://github.com/rok/arrow/pull/50&sa=D&source=editors&ust=1779321593170177&usg=AOvVaw3biVgni8GRFffBKM9DQGhn> on VECTOR repetition type shows 6-7x roundtrip speedup vs LIST - Past and related discussions [1] <https://www.google.com/url?q=https://lists.apache.org/thread/xot5f3ghhtc82n1bf0wdl9zqwlrzqks3&sa=D&source=editors&ust=1779321593170626&usg=AOvVaw25sIzHy6iPXxAe0FJUHDOs> [2] <https://www.google.com/url?q=https://lists.apache.org/thread/nmq7odlbg1p6yx0hg00clzjbc3tb1tc3&sa=D&source=editors&ust=1779321593170801&usg=AOvVaw19VhlMiDFpM4NHwkPm3xN7> [3] <https://www.google.com/url?q=https://lists.apache.org/thread/qxtksj5tnlhrtbxzhk3cdvrkfyq34nwg&sa=D&source=editors&ust=1779321593170973&usg=AOvVaw1l0E_-m1HpDmvFdh9wYG65> - Rok: - Need more feedback in the doc above - 3 options - Small POC for Repetition type in c++ - Write: 7x time speedup on fixed size list - Read: 15x faster - Dan: - People are interested about this. - Will review - Did we discuss encodings? - Rok: did not look into this so far. TODO - Floats or integers (ALP? PFoR?) - What typical vector size? (500?) - Rok: 1000 or millions - Need to discuss what is a reasonable cutoff. - How do we define stats for vectors? - Good question! - Rahil: what would we use stats for? Need to define use cases for stats here. - Micah: started looking, sorry for the delay - TODO: - Everyone: Review doc - Russ: Release Automation - https://github.com/apache/parquet-java/pull/3548 <https://www.google.com/url?q=https://github.com/apache/parquet-java/pull/3548&sa=D&source=editors&ust=1779321593173800&usg=AOvVaw1Vmn6k33zDj3Yix5drokGN> On Wed, May 20, 2026 at 9:01 AM Vinoo Ganesh <[email protected]> wrote: > I won't be able to join, but Prateek can share some ALP updates from my > side. > > Thanks, > Vinoo Ganesh | [email protected] > > <[email protected]> > > > On Wed, May 20, 2026 at 11:00 AM Julien Le Dem <[email protected]> wrote: > > > The next Parquet sync is today Wednesday May 20th at 10am PT - 1pm ET - > 7pm > > CET (in ~2h) > > > > To join the invite, join the group: > > https://groups.google.com/g/apache-parquet-community-sync > > > > Everybody is welcome, bring your topic or just listen in. > > > > (Some more details on how the meeting is run: > > https://lists.apache.org/thread/bjdkscmx7zvgfbw0wlfttxy8h6v3f71t ) > > >
