## Description: A column-oriented data file format designed for efficient data storage and retrieval. It provides high performance compression and encoding schemes to handle complex data in bulk and is supported in many programming languages and analytics tools.
## Project Status: Current project status: Parquet is an ongoing, fairly mature project. As a file format, new features are added relatively slowly as backward compatibility is required. There is an increase of activity towards making changes to improve the format under the "Parquet V3" label (see project activity below). Issues for the board: none ## Membership Data: Apache Parquet was founded 2015-04-21 (9 years ago) There are currently 38 committers and 28 PMC members in this project. The Committer-to-PMC ratio is roughly 5:4. Community changes, past quarter: - Gang Wu was added to the PMC on 2024-05-10 - No new committers. Last addition was Gang Wu on 2023-02-28. - Julien Le Dem is now the PMC chair. Thank you Xinli for your service! ## Project Activity: - Discussions on adding Parquet extension support: (Parquet extensions: https://docs.google.com/document/d/1KkoR0DjzYnLQXO-d0oRBv2k157IZU0_injqd4eV4WiI/edit). The end goal is to allow fast iteration for new features and accelerate innovation. - Adding support for geo data types in Parquet. This is a feature that progresses in the wider Open Source data ecosystem (including in Iceberg for example). - There are discussions to clarify the process for adopting new features for parquet-format and release for Parquet Java https://lists.apache.org/thread/nq7n6pbp222txrfo232ybgpvlvpmykbp - "Parquet V3": parquet-format 2.10.0 was released on 2023-11-20 There are a few discussions under the "Parquet V3" label. I put this in quotes as the goal is not to make a major incompatible release but instead to add functionality or change the format in a backwards compatible way in a few areas: - Improve footer metadata format to improve wide schemas access: Wide schemas are schemas with many columns (1000s. 10,000s or more) Currently, the footer is one thrift data structure. This means that when reading a few columns of a very wide file, one must scan all the columns' metadata to read the few interesting columns. When the metadata is large, this is significant overhead. Current discussion includes splitting the thrift metadata or using flatbuffers (like the Arrow project). In particular this requires a mechanism to add a new footer in a way that doesn't break old readers in the transition period. - New encodings: In particular, encodings that compress better time series or strings. Consensus is to add few encodings that will solve this well on average. A few research papers on this topic have been mentioned. - Cross validation: As the ecosystem has grown quite a bit since the initial release of Parquet. There are discussions to introduce a new cross compatibility testing framework to ensure various integrations in open source or proprietary projects are compatible and respect the same semantics. See https://github.com/apache/parquet-format/issues/441 - The Parquet-MR has been renamed to Parquet-Java to better reflect what’s in the repository. Parquet-Java has done two releases: 1.14.0 in May 2024, and 1.14.1 in June 2024. - Parquet C++ implementation location: A while back the Parquet C++ was moved to the Arrow repo to ease dependency management between the 2 code bases. The C++ language in particular makes cross repo dependencies difficult. This has raised questions on whether the Parquet C++ code base should move back to its own repo to clarify governance. The current consensus (across the Parquet and Arrow PMCs) is to keep it as is because of technical difficulties to move it without making C++ development across the two repo painful. - Issue migration to GitHub: as issue tracking was being migrated for the parquet-cpp codebase, moving other issues to GitHub added relatively little overhead. We migrated 2485 past and current issues from Parquet Jira to GitHub issue trackers. We strived to keep contents and metadata as close to the originals as possible to minimize disruption to work of contributors and keep the historical record of work. Comments, issue crosslinks, attachments, versions, priorities and labels were preserved wherever possible. Authorship is indicated with Jira and GitHub (where known) usernames. All issues for Apache Parquet are now tracked in GitHub issue trackers of parquet-java, parquet-format, parquet-testing, parquet-site and arrow (for parquet-cpp). - There is some effort to document the client feature compatibility matrix across the ecosystem that is currently under discussion: https://github.com/apache/parquet-site/pull/34 ## Community Health: There is a surge in email traffic linked to the "Parquet V3" discussion summarized above (~+300% on the dev list). This should sustain over the next few quarters as we make progress towards a V3.
