Re: sync today Wednesday Jun 3rd

Julien Le Dem Wed, 03 Jun 2026 14:54:00 -0700

Attendees:

   -


   Julien: Datadog, versioning.
   -

   Neelesh Salian: Apple, listening in
   -

   Shawn Chang: AWS, listening in
   -

   Russell, Snowflake - File Type, Vector Type, micro row groups :|, Release
   Automation PR <https://github.com/apache/parquet-java/pull/3548>
   -

   Ryan Blue (Databricks) - Coordinating forward-incompatible changes to go
   faster
   -

   Divjot Arora: Databricks, non-contiguous pages, file type, int96 stats
   -

   Vinoo Ganesh (Kepler) - parquet-java ALP
   -

   Gunnar Morling, Confluent; update on Hardwood
   <https://hardwood.dev/latest/>
   -

   Rok Mihevc: G-Research/Arctos Alliance, vector-like datatype proposal,
   new footer
   -

   Connor Tsui: Spiral, listening in
   -

   Jiayi Wang: Databricks, footer update
   -

   Daniel Weeks: Databricks, File Type, Format Structure, Versioning
   -

   Daniel Lee: Georgia Tech: Listening In
   -

   Stevel: Cloudera,
   -

   Kurtis Wright: AWS, listening in
   -

   Arnav: FSST
   -

   Alekhya Manem: Oracle, listening in


Agenda/Notes:


   -

   [Dan Weeks] Versioning
   
<https://docs.google.com/document/d/1zrbGT4kRCEdadBUludwfQR9b2CfLgH-RWn9zE84gYfg/edit?tab=t.0#heading=h.aozivdm2oj4d>
   (again)
   -

      Doc shared on the list to discuss Versioning
      -

         Google to accelerate, reduce confusion (“Parquet V2” is an example
         of confusing versioning)
         -

         Writer side: what breaks when the reader breaks?
         -

         Blanket 2-year period is not reflective of reality. Some things
         get adopted faster. Some things linger.
         -

         2 goals:
         -

            Clarifying levels of incompatibility (whole file, one column,
            optional understanding) and making sure an old reader will
fail instead of
            reading the wrong thing but will not have false positives
(fail when it
            could have read)
            -

            Simplifying communication over the compatibility matrix
            -

               Instead of long list of features =>  “I support all V3
               features”
               -

               Creates incentive to support straggler feature
               implementation.
               -

      [Ryan] Coordinating forward-incompatible changes to go faster
      -

         New features go in “next version” bucket
         -

         Later marked as part of Version X.
         -

      [Andrew]
      -

         Ryan had a good way of breaking out different types of changes and
         maybe treating such changes differently:
         -

            Backwards compatible changes (old readers can still read, but
            maybe degraded experience – e.g BloomFilters) not as
important to version
            -

            Backwards incompatible changes (old readers can’t read files
            written with new features) go out only with major releases
(packages of
            releases)
            -

   Encoding:
   -

      ALP
      -

         [Vinoo] Parquet-java PR ready -
         https://github.com/apache/parquet-java/pull/3397
         -

         [Vinoo] Parquet-testing open questions -
         https://github.com/apache/parquet-testing/pull/100
         -

            Should we store large files in parquet-testing (cc Andrew /
            Prateek)
            -

         Open question:
         -

            Do we want large files in parquet-testing?
            -

               Need to finalize this to get to the vote.
               -

               Recommendation to do LFS (github optimized)
               -

         Cross language testing:
         -

            Write java -> read C++
            -

            Rust almost done
            -

         Action: Prateek to send email to start the voting process.
         -

      FSST:
      -

         Updated perf numbers
         
<https://docs.google.com/document/d/1Xg2b8HR19QnI3nhtQUDWZJhCLwJzW6y9tU1ziiLFZrM/edit?tab=t.0>
         .
         -

         Improvements to perf.
         -

         Comparison to other encodings numbers in the doc above.
         -

         Reached out to authors of FSST papers and they are giving feedback.
         -

            Ex:
            -

               Using larger codes: 12 or 16bits instead of 8bits codes.
               -

   [Burak] File Type (mailing list
   <https://lists.apache.org/thread/od9hxfssjgnmsh23o18q78hszowq7pcy>) (design
   doc
   
<https://docs.google.com/document/d/1AiwrstqkwkBoOZqgOkm9JGwSMcNeHyLR7EEj1CVqpZQ/edit?tab=t.0#heading=h.k8qyue4jj4rn>
   )
   -

      Discussion Items
      -

         Metadata Field, etag, content_type, etc
         -

      There’s been back and forth:
      -

         Ex: Do we include file metadata?
         -

         Dan’s Proposal: keep primitive types as simple as possible.
         -

         Divjot: agreement to keep simple and have the option to add
         metadata in the future.
         -

         Russel: same: KIS
         -

         Ryan: +1
         -

         Rahil: question. Better to have single type (content vs not) or is
         it complex?
         -

            => ryan: don’t want to deal with mixed content/pointer now. Add
            content in the future.
            -

         Dan: to check with list if anybody disagree with the consensus.
         -

   [Dan] Non-contiguous Pages / Logical Row Groups (mailing list
   <https://lists.apache.org/thread/jgq7wk3641ss27y851zdok1v2nskyvhd>) (design
   doc
   
<https://docs.google.com/document/d/1nntcYM98PFSkHT70RexSBPtCnWqg1uRJ5_7m--ZgbsA/edit?tab=t.0#heading=h.k4r8orckhbx0>
   )
   -

      Discussion Items
      -

         Approach #1: Preserve Row Group boundaries (draft example
         <https://github.com/danielcweeks/incubator-parquet-mr/pull/1>)
         -

         Approach #2: Purely logical (micro) row groups (draft example
         <https://github.com/apache/parquet-java/pull/3578>) (
         -

         Impact on spec fields
         -

      Alkis proposes extending to purely logical row groups
      -

         Draft examples of implementation.
         -

         Row ranges divergent from row groups (pages can span row ranges)
         -

         => this becomes “micro-row groups”. Stats are per ranges instead
         of row group.
         -

         More of a conceptual change than a big implementation change:
         actually not much work impl wise.
         -

      TODO: Please check out the different tabs in the doc linked above.
      -

   [Steve] Parquet 1.18 (esp. variant)
   -

      What do we want in the 1.18 release.
      -

      Variant: opened PR on hardening reader.
      -

      Thanks Fokko.
      -

      Russel: how about we try to PR bellow for this release? Steve: yes
      -

   [Russ] Parquet Release Automation
   https://github.com/apache/parquet-java/pull/3548
   -

      Would make it a lot easier to release
      -

      Release manager doesn’t need to sign. The infra does it.
      -

   [Rok] Vector Type:
   -

      Design doc
      
<https://docs.google.com/document/d/1nf30OqK_UqxA4YTEZQszmOBEG56m9M5mp9rIYC2SUWc/edit?tab=t.0>
      discussion is winding down. How should we proceed?
      -

         Start working on draft implementations
         -

         Follow up on the dev list
         -

      C++ PoC <https://github.com/rok/arrow/pull/51> with VECTOR repetition
      type shows significant speedup vs LIST. Python wrapper on top of it shows
      comparable performance to numpy space wise.


   -

   [Divjot] New footer update
   -

      Modular + SoA encoding path (Parquet footer working doc
      
<https://docs.google.com/document/d/1eiygLyg_jcEiPnF_F6XI01dmX2DDa6UjDJH9ZAJCKe8/edit?usp=sharing>
      )
      -

      Next meeting we will discuss offsetIndex modular + footer benchmark,
      same time next week
      -

      General consensus on using a modular footer
      -

      Thrift vs flatbuf - thrift appears to be efficient with custom
      decoders
      -

         Question on jump table representation
         -

      Next steps: draft implementation and benchmarks
      -

   FYI int96 stats for parquet-java (PR
   <https://github.com/apache/parquet-java/pull/3590>)
   -

      Needs some alignment on detecting invalid stats from buggy
writers (mailing
      list
      <https://lists.apache.org/thread/b95w2sb99d9v5lxscrfkg1tffj1k4p2w>)
      -

      Ryan: have an explicit SortOrder to indicate a writer is producing
      valid int96 stats rather than a “createdBy” allowlist
      -

   [Gunnar] Hardwood
   -

      1.0.0.CR1
      
<https://www.morling.dev/blog/improved-column-reader-api-geospatial-support-hardwood-1-0-0-cr1-available/>
      released this week; 1.0 Final planned for next week
      -

      Would love for folks to take it for a spin and report back any issues


On Wed, Jun 3, 2026 at 8:44 AM Julien Le Dem <[email protected]> wrote:

> The next Parquet sync is today Wednesday Jun 3rd at 10am PT - 1pm ET -
> 7pm CET (in ~1h)
>
> To join the invite, join the group:
> https://groups.google.com/g/apache-parquet-community-sync
>
> Everybody is welcome, bring your topic or just listen in.
>
> (Some more details on how the meeting is run:
> https://lists.apache.org/thread/bjdkscmx7zvgfbw0wlfttxy8h6v3f71t )
>

Re: sync today Wednesday Jun 3rd

Reply via email to