Attendees: -
Julien: Datadog, versioning. - Neelesh Salian: Apple, listening in - Shawn Chang: AWS, listening in - Russell, Snowflake - File Type, Vector Type, micro row groups :|, Release Automation PR <https://github.com/apache/parquet-java/pull/3548> - Ryan Blue (Databricks) - Coordinating forward-incompatible changes to go faster - Divjot Arora: Databricks, non-contiguous pages, file type, int96 stats - Vinoo Ganesh (Kepler) - parquet-java ALP - Gunnar Morling, Confluent; update on Hardwood <https://hardwood.dev/latest/> - Rok Mihevc: G-Research/Arctos Alliance, vector-like datatype proposal, new footer - Connor Tsui: Spiral, listening in - Jiayi Wang: Databricks, footer update - Daniel Weeks: Databricks, File Type, Format Structure, Versioning - Daniel Lee: Georgia Tech: Listening In - Stevel: Cloudera, - Kurtis Wright: AWS, listening in - Arnav: FSST - Alekhya Manem: Oracle, listening in Agenda/Notes: - [Dan Weeks] Versioning <https://docs.google.com/document/d/1zrbGT4kRCEdadBUludwfQR9b2CfLgH-RWn9zE84gYfg/edit?tab=t.0#heading=h.aozivdm2oj4d> (again) - Doc shared on the list to discuss Versioning - Google to accelerate, reduce confusion (“Parquet V2” is an example of confusing versioning) - Writer side: what breaks when the reader breaks? - Blanket 2-year period is not reflective of reality. Some things get adopted faster. Some things linger. - 2 goals: - Clarifying levels of incompatibility (whole file, one column, optional understanding) and making sure an old reader will fail instead of reading the wrong thing but will not have false positives (fail when it could have read) - Simplifying communication over the compatibility matrix - Instead of long list of features => “I support all V3 features” - Creates incentive to support straggler feature implementation. - [Ryan] Coordinating forward-incompatible changes to go faster - New features go in “next version” bucket - Later marked as part of Version X. - [Andrew] - Ryan had a good way of breaking out different types of changes and maybe treating such changes differently: - Backwards compatible changes (old readers can still read, but maybe degraded experience – e.g BloomFilters) not as important to version - Backwards incompatible changes (old readers can’t read files written with new features) go out only with major releases (packages of releases) - Encoding: - ALP - [Vinoo] Parquet-java PR ready - https://github.com/apache/parquet-java/pull/3397 - [Vinoo] Parquet-testing open questions - https://github.com/apache/parquet-testing/pull/100 - Should we store large files in parquet-testing (cc Andrew / Prateek) - Open question: - Do we want large files in parquet-testing? - Need to finalize this to get to the vote. - Recommendation to do LFS (github optimized) - Cross language testing: - Write java -> read C++ - Rust almost done - Action: Prateek to send email to start the voting process. - FSST: - Updated perf numbers <https://docs.google.com/document/d/1Xg2b8HR19QnI3nhtQUDWZJhCLwJzW6y9tU1ziiLFZrM/edit?tab=t.0> . - Improvements to perf. - Comparison to other encodings numbers in the doc above. - Reached out to authors of FSST papers and they are giving feedback. - Ex: - Using larger codes: 12 or 16bits instead of 8bits codes. - [Burak] File Type (mailing list <https://lists.apache.org/thread/od9hxfssjgnmsh23o18q78hszowq7pcy>) (design doc <https://docs.google.com/document/d/1AiwrstqkwkBoOZqgOkm9JGwSMcNeHyLR7EEj1CVqpZQ/edit?tab=t.0#heading=h.k8qyue4jj4rn> ) - Discussion Items - Metadata Field, etag, content_type, etc - There’s been back and forth: - Ex: Do we include file metadata? - Dan’s Proposal: keep primitive types as simple as possible. - Divjot: agreement to keep simple and have the option to add metadata in the future. - Russel: same: KIS - Ryan: +1 - Rahil: question. Better to have single type (content vs not) or is it complex? - => ryan: don’t want to deal with mixed content/pointer now. Add content in the future. - Dan: to check with list if anybody disagree with the consensus. - [Dan] Non-contiguous Pages / Logical Row Groups (mailing list <https://lists.apache.org/thread/jgq7wk3641ss27y851zdok1v2nskyvhd>) (design doc <https://docs.google.com/document/d/1nntcYM98PFSkHT70RexSBPtCnWqg1uRJ5_7m--ZgbsA/edit?tab=t.0#heading=h.k4r8orckhbx0> ) - Discussion Items - Approach #1: Preserve Row Group boundaries (draft example <https://github.com/danielcweeks/incubator-parquet-mr/pull/1>) - Approach #2: Purely logical (micro) row groups (draft example <https://github.com/apache/parquet-java/pull/3578>) ( - Impact on spec fields - Alkis proposes extending to purely logical row groups - Draft examples of implementation. - Row ranges divergent from row groups (pages can span row ranges) - => this becomes “micro-row groups”. Stats are per ranges instead of row group. - More of a conceptual change than a big implementation change: actually not much work impl wise. - TODO: Please check out the different tabs in the doc linked above. - [Steve] Parquet 1.18 (esp. variant) - What do we want in the 1.18 release. - Variant: opened PR on hardening reader. - Thanks Fokko. - Russel: how about we try to PR bellow for this release? Steve: yes - [Russ] Parquet Release Automation https://github.com/apache/parquet-java/pull/3548 - Would make it a lot easier to release - Release manager doesn’t need to sign. The infra does it. - [Rok] Vector Type: - Design doc <https://docs.google.com/document/d/1nf30OqK_UqxA4YTEZQszmOBEG56m9M5mp9rIYC2SUWc/edit?tab=t.0> discussion is winding down. How should we proceed? - Start working on draft implementations - Follow up on the dev list - C++ PoC <https://github.com/rok/arrow/pull/51> with VECTOR repetition type shows significant speedup vs LIST. Python wrapper on top of it shows comparable performance to numpy space wise. - [Divjot] New footer update - Modular + SoA encoding path (Parquet footer working doc <https://docs.google.com/document/d/1eiygLyg_jcEiPnF_F6XI01dmX2DDa6UjDJH9ZAJCKe8/edit?usp=sharing> ) - Next meeting we will discuss offsetIndex modular + footer benchmark, same time next week - General consensus on using a modular footer - Thrift vs flatbuf - thrift appears to be efficient with custom decoders - Question on jump table representation - Next steps: draft implementation and benchmarks - FYI int96 stats for parquet-java (PR <https://github.com/apache/parquet-java/pull/3590>) - Needs some alignment on detecting invalid stats from buggy writers (mailing list <https://lists.apache.org/thread/b95w2sb99d9v5lxscrfkg1tffj1k4p2w>) - Ryan: have an explicit SortOrder to indicate a writer is producing valid int96 stats rather than a “createdBy” allowlist - [Gunnar] Hardwood - 1.0.0.CR1 <https://www.morling.dev/blog/improved-column-reader-api-geospatial-support-hardwood-1-0-0-cr1-available/> released this week; 1.0 Final planned for next week - Would love for folks to take it for a spin and report back any issues On Wed, Jun 3, 2026 at 8:44 AM Julien Le Dem <[email protected]> wrote: > The next Parquet sync is today Wednesday Jun 3rd at 10am PT - 1pm ET - > 7pm CET (in ~1h) > > To join the invite, join the group: > https://groups.google.com/g/apache-parquet-community-sync > > Everybody is welcome, bring your topic or just listen in. > > (Some more details on how the meeting is run: > https://lists.apache.org/thread/bjdkscmx7zvgfbw0wlfttxy8h6v3f71t ) >
