This is an automated email from the ASF dual-hosted git repository.
emkornfield pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/parquet-format.git
The following commit(s) were added to refs/heads/master by this push:
new 5a5c894 PARQUET-2489: Guidance on feature releases (#258)
5a5c894 is described below
commit 5a5c8948e60770f8a8356a8f5e616d5ae1079d4b
Author: emkornfield <[email protected]>
AuthorDate: Thu Jul 25 00:53:21 2024 -0700
PARQUET-2489: Guidance on feature releases (#258)
Add guidance on adding new features and turning features on by default.
Co-authored-by: Rok Mihevc <[email protected]>
Co-authored-by: Ed Seidl <[email protected]>
Co-authored-by: Andrew Lamb <[email protected]>
Co-authored-by: Antoine Pitrou <[email protected]>
Co-authored-by: Gang Wu <[email protected]>
---
CONTRIBUTING.md | 173 +++++++++++++++++++++++++++++++++++++++++++++++++++++++-
1 file changed, 172 insertions(+), 1 deletion(-)
diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md
index 38a845e..d6049a8 100644
--- a/CONTRIBUTING.md
+++ b/CONTRIBUTING.md
@@ -17,7 +17,7 @@
- under the License.
-->
-Recommendations and requirements for how to best contribute to Parquet. We
strive to obey these as best as possible. As always, thanks for
contributing--we hope these guidelines make it easier and shed some light on
our approach and processes.
+Recommendations and requirements for how to best contribute to Parquet. We
strive to obey these as best as possible. As always, thanks for
contributing--we hope these guidelines make it easier and shed some light on
our approach and processes. If you believe there should be a change or
exception to these rules please bring it up for discussion on the developer
mailing list ([email protected]).
### Key branches
- `master` has the latest stable changes
@@ -29,3 +29,174 @@ Recommendations and requirements for how to best contribute
to Parquet. We striv
### License
By contributing your code, you agree to license your contribution under the
terms of the APLv2:
https://github.com/apache/parquet-format/blob/master/LICENSE
+
+### Additions/Changes to the Format
+
+Note: This section applies to actual functional changes to the specification.
+Fixing typos, grammar, and clarifying concepts that would not change the
+semantics of the specification can be done as long as a committer feels
comfortable
+to merge them. When in doubt starting a discussion on the dev mailing list is
+encouraged.
+
+The general steps for adding features to the format are as follows:
+
+1. Design/scoping: The goal of this phase is to identify design goals of a
+ feature and provide some demonstration that the feature meets those goals.
+ This phase starts with a discussion of changes on the developer mailing list
+ ([email protected]). Depending on the scope and goals of the feature
the
+ it can be useful to provide additional artifacts as part of a discussion.
The
+ artifacts can include a design docuemnt, a draft pull request to make the
+ discussion concrete and/or an prototype implementation to demostrate the
+ viability of implementation. This step is complete when there is lazy
+ consensus. Part of the consensus is whether it is sufficient to provide two
+ working implementations as outlined in step 2, or if demonstration of the
+ feature with a downstream query engine is necessary to justify the feature
+ (e.g. demonstrate performance improvements in the Apache Arrow C++ Dataset
+ library, the Apache DataFusion query engine, or any other open source
+ engine).
+
+2. Completeness: The goal of this phase is to ensure the feature is viable,
+ there is no ambiguity in its specification by demonstrating compatibility
+ between implementations. Once a change has lazy consensus, two
+ implementations of the feature demonstrating interopability must also be
+ provided. One implementation MUST be
+ [`parquet-java`](http://github.com/apache/parquet-java). It is preferred
+ that the second implementation be
+ [`parquet-cpp`](https://github.com/apache/arrow) or
+ [`parquet-rs`](https://github.com/apache/arrow-rs), however at the
discretion
+ of the PMC any open source Parquet implementation may be acceptable.
+ Implementations whose contributors actively participate in the community
+ (e.g. keep their feature matrix up-to-date on the Parquet website) are more
+ likely to be considered. If discussed as a requirement in step 1 above,
+ demonstration of integration with a query engine is also required for this
+ step. The implementations must be made available publicly, and they should
be
+ fit for inclusion (for example, they were submitted as a pull request
against
+ the target repository and committers gave positive reviews). Reports on the
+ benefits from closed source implementations are welcome and can help lend
+ weight to features desirability but are not sufficient for acceptance of a
+ new feature.
+
+Unless otherwise discussed, it is expected the implementations will be
developed
+from their respective main branch (i.e. backporting is not required), to
+demonstrate that the feature is mergeable to its implementation.
+
+3. Ratification: After the first two steps are complete a formal vote is held
on
+ [email protected] to officially ratify the feature. After the vote
+ passes the format change is merged into the `parquet-format` repository and
+ it is expected the changes from step 2 will also be merged soon after
+ (implementations should not be merged until the addition has been merged to
+ `parquet-format`).
+
+#### General guidelines/preferences on additions.
+
+1. To the greatest extent possible changes should have an option for forward
+ compatibility (old readers can still read files). The [compatibility and
+ feature enablement](#compatibility-and-feature-enablement) section below
+ provides more details on expectations for changes that break compatibility.
+
+2. New encodings should be fully specified in this repository and not
+ rely on an external dependencies for implementation (i.e. `parquet-format`
is
+ the source of truth for the encoding). If it does require an
+ external dependency, then the external dependency must have its
+ own specification separate from implementation.
+
+3. New compression mechanisms should have a pure Java implementation that can
be
+ used as a dependency in `parquet-java`, exceptions may be
+ discussed on the mailing list to see if a non-native Java
+ implementation is acceptable.
+
+### Releases
+
+The Parquet PMC aims to do releases of the format package only as needed when
+new features are introduced. If multiple new features are being proposed
+simultaneously some features might be consolidated into the same release.
+Guidance is provided below on when implementations should enable features added
+to the specification. Due to confusion in the past over Parquet versioning it
+is not expected that there will be a 3.x release of the specification in the
+foreseeable future.
+
+### Compatibility and Feature Enablement
+
+For the purposes of this discussion we classify features into the following
buckets:
+
+1. Backward compatible. A file written under an older version of the format
+ should be readable under a newer version of the format.
+
+2. Forward compatible. A file written under a newer version of the format with
+ the feature enabled can be read under an older version of the format, but
+ some metadata might be missing or performance might be suboptimal. Simply
+ phrased, forward compatible means all data can be read back in an older
+ version of the format. New logical types are considered forward
+ compatible despite the loss of semantic meaning.
+
+3. Forward incompatible. A file written under a newer version of the format
with
+ the feature enabled cannot be read under an older version of the format
(e.g.
+ adding and using a new compression algorithm). It is expected any feature in
+ this category will provide a signal to older readers, so they can
+ unambiguously determine that they cannot properly read the file (e.g. via
+ adding a new value to an existing enum).
+
+New features are intended to be widely beneficial to users of Parquet, and
+therefore it is hoped third-party implementations will adopt them quickly after
+they are introduced. It is assumed that writing new parts of the format, and
+especially forward incompatible features, will be configured with a feature
flag
+defaulted to "off", and at some future point the feature is turned on by
default
+(reading of the new feature will typically be enabled without configuration or
+defaulted to on). Some amount of lead time is desirable to ensure a critical
+mass of Parquet implementations support a feature to avoid compatibility issues
+across the ecosystem. Therefore, the Parquet PMC gives the following
+recommendations for managing features:
+
+1. Backward compatibility is the concern of implementations but given the
+ ubiquity of Parquet and the length of time it has been used, libraries
should
+ support reading older versions of the format to the greatest extent
possible.
+
+2. Forward compatible features/changes may be enabled and used by default in
+ implementations once the parquet-format containing those changes has been
+ formally released. For features that may pose a significant performance
+ regression to older format readers, libaries should consider delaying
default
+ enablement until 1 year after the release of the parquet-java implementation
+ that contains the feature implementation.
+
+3. Forward incompatible features/changes should not be turned on by default
+ until 2 years after the parquet-java implementation containing the feature
is
+ released. It is recommended that changing the default value for a forward
+ incompatible feature flag should be clearly advertised to consumers (e.g.
via
+ a major version release if using Semantic Versioning, or highlighed in
+ release notes).
+
+For forward compatible changes which have a high chance of performance
+regression for older readers and forward incompatible changes, implementations
+should clearly document the compatibility issues. Additionally, while it is up
+to maintainers of individual open-source implementations to make the best
decision to serve
+their ecosystem, they are encouraged to start enabling features by default
along
+the same timelines as `parquet-java`. Parquet-java will wait to enable features
+by default until the most conservative timelines outlined above have been
+exceeded. This timeline is an attempt to balance ensuring
+new features make their way into the ecosystem and avoiding
+breaking compatiblity for readers that are slower to adopt new standards. We
+encourage earlier adoption of new features when an organization using Parquet
+can guarantee that all readers of the parquet files they produce can read a new
+feature.
+
+After turning a feature on by default implementations
+are encouraged to keep a configuration to turn off the feature.
+A recommendation for full deprecation will be made in a future
+iteration of this document.
+
+For features released prior to October 2024, target dates for each of these
+categories will be updated as part of the `parquet-java 2.0` release process
+based on a collected feature compatibility matrix.
+
+For each release of `parquet-java` or `parquet-format` that influences this
+guidance it is expected exact dates will be added to parquet-format to provide
+clarity to implementors (e.g. When `parquet-java` 2.X.X is released, any new
+format features it uses will be updated with concrete dates). As part of
+`parquet-format` releases the compatibility matrix will be updated to contain
+the release date in the format. Implementations are also encouraged to provide
+implementation date/release version information when updating the feature
+matrix.
+
+End users of software are generally encouraged to consult the feature matrix
+and vendor documentation before enabling features that are not yet widely
+adopted.