Re: Parquet sync today Wednesday May 20th

Julien Le Dem Wed, 20 May 2026 16:03:06 -0700

Notes:
https://docs.google.com/document/d/e/2PACX-1vSDHW7gvG8eO6aIxaIVPrZSqYYhtRDb5W1imnbpM4QRYNPsTwEO1fU5z7SEhVIFa4YqWJeSRJ9tcXYS/pub


Attendees: Apache Parquet Community Sync
<[email protected]>

   - Julien: Datadog, progress updates (ALP, …)
   - Micah Kornfield: Databricks, just listening in
   - Kenny Daniel, Hyperparam, listening in
   - Neelesh Salian: Apple, listening in
   - Will Edwards: Spotify, listening in
   - Connor Tsui: Spiral, listening in
   - Benjamin Owad: Snowflake, listening in
   - Jiayi: Databricks, footer, non-contiguous pages, working group schedule
   - Alkis: Databricks, non-contiguous pages, file type, footer
   - Div: Databricks, non-contiguous pages, file type, footer
   - Andrew Lamb: InfluxData, just listening in
   - Burak: Databricks, file type
   - Steve Loughran, Cloudera, Variants
   - Rok Mihevc: G-Research/Arctos Alliance, vector-like datatype proposal
   - Ismaël Mejía, Microsoft, Java performance optimizations
   - Daniel Weeks: Databricks, footer, non-contiguous pages, file type, etc.
   - Fokko Driesprong: Databricks: New Types
   - Rahil Chertara: Onehouse, listening in

Agenda/Notes:

   - Encodings progress:


   - ALP


   - [Prateek]


   - all comments addressed on the spec
   - Cpp same
   - Vinoo: Java impl with end-to-end test working
   
<https://www.google.com/url?q=https://github.com/apache/parquet-java/pull/3397.&sa=D&source=editors&ust=1779321593156870&usg=AOvVaw0RbSKQiVcTHp-Cq8CNpnlt>


   - Cross language testing J<>Cpp almost done


   - Micah: make sure coverage on java unsigned int math is enough.
   - Andrew: Rust implementation is progressing:
   https://github.com/apache/arrow-rs/pull/9372
   
<https://www.google.com/url?q=https://github.com/apache/arrow-rs/pull/9372&sa=D&source=editors&ust=1779321593157670&usg=AOvVaw1rUQQXhnFat8_PCHv41W4g>

   - Incremental encoding/decoding for later.
   - Andrew: spec is in very good shape. Would appreciate another pair of
   eyes: https://github.com/apache/parquet-format/pull/557
   
<https://www.google.com/url?q=https://github.com/apache/parquet-format/pull/557&sa=D&source=editors&ust=1779321593158302&usg=AOvVaw31IOQQt9FEqiV_4sLf_grP>

   - TODO: move cpp to Pqt repo. (pending Antoine’s review). Can be done
   later.


   - Goal to vote on the mailing list within the next 2 weeks.


   - FSST


   - Arnav addressing Micah’s comments
   - TODO: Painpoint of performance on POC
   - Micah: Design question. Encoding values together.
   - TODO: Andrew to intro CWI to arnav for feedback.


   - Non-contiguous pages


   - Dan: doc
   
<https://www.google.com/url?q=https://docs.google.com/document/d/1nntcYM98PFSkHT70RexSBPtCnWqg1uRJ5_7m--ZgbsA/edit?tab%3Dt.0%23heading%3Dh.k4r8orckhbx0&sa=D&source=editors&ust=1779321593159916&usg=AOvVaw0YGOIza6jzGnD8vpl3sARG>
to
   review
   - Dictionary may need to have extra metadata to find it.
   - Page index only has data pages
   - Might need discussion on handling the incompatible change (PAR2)
   - Broader discussion on knowing the length of things in addition to
   where they start. (exemple: bloom filter)
   - TODO:


   - DAN: POC to validate feasibility
   - Everyone: please review the doc above


   - Java Performance optimizations:


   - Re organized PRs per encoding / improvement type
   
<https://www.google.com/url?q=https://github.com/apache/parquet-java/issues/3530&sa=D&source=editors&ust=1779321593161627&usg=AOvVaw34vkbjT03G5eKMTF9WUtGS>,
   please help with review
   - Bypassing Hadoop compression abstractions = ~1.5x better
   (de)compression should we?


   - It is good to reduce dependency on Hadoop APIs.


   - En(de)codings improvements for the Spark Vectorized Parquet Reader +
   support for BSS
   
<https://www.google.com/url?q=https://github.com/apache/spark/issues/56011&sa=D&source=editors&ust=1779321593162585&usg=AOvVaw1J6tn2Co5RtWbuyRgB9pfH>
   - TODO:


   - Please review PRs!
   - Discussion on deprecating LZO codec


   - File type


   - Burak: list thread:
   https://lists.apache.org/thread/od9hxfssjgnmsh23o18q78hszowq7pcy
   
<https://www.google.com/url?q=https://lists.apache.org/thread/od9hxfssjgnmsh23o18q78hszowq7pcy&sa=D&source=editors&ust=1779321593163486&usg=AOvVaw2pgRoO8Z8UuGMQcBm8PTYP>
   - General need for supporting AI use cases
   - Related:


   - lancedb blog
   
<https://www.google.com/url?q=https://www.lancedb.com/blog/lance-blob-v2&sa=D&source=editors&ust=1779321593163975&usg=AOvVaw3cKlHihGIIv6sqQyFMqwAQ>
union
   of file reference vs embedded blob.
   - Inlining is connected to the “non-continuous pages” discussion.


   - Because big blobs create asymmetry in column size that gets addressed
   by non-contiguous pages.


   - Discussion items:


   - Should it be in Parquet?
   - Logical type vs extension?


   - Dan: opinion: logical type
   - Russel: need for interoperable semantics.


   - Are paths relative?
   - If reference counting, probably not in parquet
   - Russel: Cases:


   - Small inline blobs
   - Separate Column files with inline blobs
   - External Ref types


   - TODO:


   - Share a doc on the mailing list, get feedback from Antoine.


   - Footer


   - Parquet footer working doc
   
<https://www.google.com/url?q=https://docs.google.com/document/d/1eiygLyg_jcEiPnF_F6XI01dmX2DDa6UjDJH9ZAJCKe8/edit?usp%3Dsharing&sa=D&source=editors&ust=1779321593166425&usg=AOvVaw2sMX_M-bb28_HDCy6Ji2jJ>
   - Jiayi: listed options in the docs and kicked off experiments.


   - Schedule meeting through mailing list


   - TODO: feedback on the doc above


   - Variant


   - Read optimizations merged in:
   https://github.com/apache/parquet-java/pull/3481
   
<https://www.google.com/url?q=https://github.com/apache/parquet-java/pull/3481&sa=D&source=editors&ust=1779321593167340&usg=AOvVaw2RTUFJqDhI16pIVPC8rEd2>
   - On-going hardening variant work:
   https://github.com/apache/parquet-java/pull/3562
   
<https://www.google.com/url?q=https://github.com/apache/parquet-java/pull/3562&sa=D&source=editors&ust=1779321593167734&usg=AOvVaw0T8EFf91neMeBWx9tJMjft>
   https://github.com/apache/parquet-testing/pull/113
   
<https://www.google.com/url?q=https://github.com/apache/parquet-testing/pull/113&sa=D&source=editors&ust=1779321593167990&usg=AOvVaw3sf5hm_MLtoOYrDIyV4cFv>
   - bug: parquet java + spark need to handle unsorted variant objects.
   - Steve: thanks Neelesh


   - Couple of bugs are being fixed, other implementers need to make sure
   they handle unsorted. => risk here.
   - Performance better but still a lot of work to do.


   - Micah: perf of rust impl?


   - Andrew: not yet heavily used.
   - Steve: good about validation of implem.


   - Neelesh: monthly Variants in Iceberg call -see the iceberg calendar


   - Vector-like type


   - Please read and comment the design doc
   
<https://www.google.com/url?q=https://docs.google.com/document/d/1nf30OqK_UqxA4YTEZQszmOBEG56m9M5mp9rIYC2SUWc/edit?tab%3Dt.0&sa=D&source=editors&ust=1779321593169825&usg=AOvVaw1eSZGF4EYpCnmNObh1NO-6>,
   we still need to agree on an approach
   - C++ PoC
   
<https://www.google.com/url?q=https://github.com/rok/arrow/pull/50&sa=D&source=editors&ust=1779321593170177&usg=AOvVaw3biVgni8GRFffBKM9DQGhn>
on
   VECTOR repetition type shows 6-7x roundtrip speedup vs LIST
   - Past and related discussions [1]
   
<https://www.google.com/url?q=https://lists.apache.org/thread/xot5f3ghhtc82n1bf0wdl9zqwlrzqks3&sa=D&source=editors&ust=1779321593170626&usg=AOvVaw25sIzHy6iPXxAe0FJUHDOs>
    [2]
   
<https://www.google.com/url?q=https://lists.apache.org/thread/nmq7odlbg1p6yx0hg00clzjbc3tb1tc3&sa=D&source=editors&ust=1779321593170801&usg=AOvVaw19VhlMiDFpM4NHwkPm3xN7>
    [3]
   
<https://www.google.com/url?q=https://lists.apache.org/thread/qxtksj5tnlhrtbxzhk3cdvrkfyq34nwg&sa=D&source=editors&ust=1779321593170973&usg=AOvVaw1l0E_-m1HpDmvFdh9wYG65>
   - Rok:


   - Need more feedback in the doc above
   - 3 options
   - Small POC for Repetition type in c++


   - Write: 7x time speedup on fixed size list
   - Read: 15x faster


   - Dan:


   - People are interested about this.
   - Will review
   - Did we discuss encodings?


   - Rok: did not look into this so far. TODO
   - Floats or integers (ALP? PFoR?)


   - What typical vector size? (500?)


   - Rok: 1000 or millions
   - Need to discuss what is a reasonable cutoff.


   - How do we define stats for vectors?


   - Good question!
   - Rahil: what would we use stats for? Need to define use cases for stats
   here.


   - Micah: started looking, sorry for the delay
   - TODO:


   - Everyone: Review doc


   - Russ: Release Automation -
   https://github.com/apache/parquet-java/pull/3548
   
<https://www.google.com/url?q=https://github.com/apache/parquet-java/pull/3548&sa=D&source=editors&ust=1779321593173800&usg=AOvVaw1Vmn6k33zDj3Yix5drokGN>


On Wed, May 20, 2026 at 9:01 AM Vinoo Ganesh <[email protected]> wrote:

> I won't be able to join, but Prateek can share some ALP updates from my
> side.
>
> Thanks,
> Vinoo Ganesh | [email protected]
>
> <[email protected]>
>
>
> On Wed, May 20, 2026 at 11:00 AM Julien Le Dem <[email protected]> wrote:
>
> > The next Parquet sync is today Wednesday May 20th at 10am PT - 1pm ET -
> 7pm
> > CET (in ~2h)
> >
> > To join the invite, join the group:
> > https://groups.google.com/g/apache-parquet-community-sync
> >
> > Everybody is welcome, bring your topic or just listen in.
> >
> > (Some more details on how the meeting is run:
> > https://lists.apache.org/thread/bjdkscmx7zvgfbw0wlfttxy8h6v3f71t )
> >
>

Re: Parquet sync today Wednesday May 20th

Reply via email to