[REPORT] Parquet - July 2024

Julien Le Dem Wed, 10 Jul 2024 12:30:20 -0700

## Description:
A column-oriented data file format designed for efficient data storage and
retrieval. It provides high performance compression and encoding schemes to
handle complex data in bulk and is supported in many programming languages and
analytics tools.


## Project Status:
Current project status: Parquet is an ongoing, fairly mature project. As a file
format, new features are added relatively slowly as backward compatibility is
required. There is an increase of activity towards making changes to
improve the format under the "Parquet V3" label (see project activity below).
Issues for the board: none

## Membership Data:
Apache Parquet was founded 2015-04-21 (9 years ago)
There are currently 38 committers and 28 PMC members in this project.
The Committer-to-PMC ratio is roughly 5:4.

Community changes, past quarter:
- Gang Wu was added to the PMC on 2024-05-10
- No new committers. Last addition was Gang Wu on 2023-02-28.
- Julien Le Dem is now the PMC chair. Thank you Xinli for your service!

## Project Activity:
- Discussions on adding Parquet extension support: (Parquet
extensions: 
https://docs.google.com/document/d/1KkoR0DjzYnLQXO-d0oRBv2k157IZU0_injqd4eV4WiI/edit).
 The end goal is to allow fast iteration for new features and
  accelerate innovation.

- Adding support for geo data types in Parquet. This is a feature that
  progresses in the wider Open Source data ecosystem
(including in Iceberg for example).

- There are discussions to clarify the process for adopting new features for
  parquet-format and release for Parquet Java
  https://lists.apache.org/thread/nq7n6pbp222txrfo232ybgpvlvpmykbp

- "Parquet V3":
   parquet-format 2.10.0 was released on 2023-11-20
   There are a few discussions under the "Parquet V3" label. I
   put this in quotes as the goal is not to make a major incompatible release
   but instead to add functionality or change the format in a backwards
   compatible way in a few areas:
  - Improve footer metadata format to improve wide schemas access: Wide
    schemas are schemas with many columns (1000s. 10,000s or more) Currently,
    the footer is one thrift data structure. This means that when reading a
    few columns of a very wide file, one must scan all the columns' metadata
    to read the few interesting columns. When the metadata is large, this is
    significant overhead. Current discussion includes splitting the thrift
    metadata or using flatbuffers (like the Arrow project). In particular this
    requires a mechanism to add a new footer in a way that doesn't break old
    readers in the transition period.
  - New encodings: In particular, encodings that compress better time series
    or strings. Consensus is to add few encodings that will solve this well on
    average. A few research papers on this topic have been mentioned.
  - Cross validation: As the ecosystem has grown quite a bit since the initial
    release of Parquet. There are discussions to introduce a new cross
    compatibility testing framework to ensure various integrations in open
    source or proprietary projects are compatible and respect the same
    semantics. See https://github.com/apache/parquet-format/issues/441

- The Parquet-MR has been renamed to Parquet-Java to better reflect what’s in
  the repository. Parquet-Java has done two releases: 1.14.0 in May 2024,
  and 1.14.1 in June 2024.

- Parquet C++ implementation location: A while back the Parquet C++ was moved
  to the Arrow repo to ease dependency management between the 2 code bases.
  The C++ language in particular makes cross repo dependencies difficult. This
  has raised questions on whether the Parquet C++ code base should move back
  to its own repo to clarify governance. The current consensus (across the
  Parquet and Arrow PMCs) is to keep it as is because of technical
  difficulties to move it without making C++ development across the two repo
  painful.

- Issue migration to GitHub: as issue tracking was being migrated for the
  parquet-cpp codebase, moving other issues to GitHub added relatively little
  overhead. We migrated 2485 past and current issues from Parquet Jira to
  GitHub issue trackers. We strived to keep contents and metadata as close to
  the originals as possible to minimize disruption to work of contributors and
  keep the historical record of work. Comments, issue crosslinks, attachments,
  versions, priorities and labels were preserved wherever possible. Authorship
  is indicated with Jira and GitHub (where known) usernames. All issues for
  Apache Parquet are now tracked in GitHub issue trackers of parquet-java,
  parquet-format, parquet-testing, parquet-site and arrow (for parquet-cpp).

- There is some effort to document the client feature compatibility matrix
  across the ecosystem that is currently under discussion:
  https://github.com/apache/parquet-site/pull/34

## Community Health:
There is a surge in email traffic linked to the "Parquet V3" discussion
summarized above (~+300% on the dev list). This should sustain over the next
few quarters as we make progress towards a V3.

[REPORT] Parquet - July 2024

Reply via email to