[
https://issues.apache.org/jira/browse/PARQUET-1950?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17728054#comment-17728054
]
ASF GitHub Bot commented on PARQUET-1950:
-----------------------------------------
raunaqmorarka commented on code in PR #164:
URL: https://github.com/apache/parquet-format/pull/164#discussion_r1211980698
##########
CoreFeatures.md:
##########
@@ -0,0 +1,181 @@
+<!--
+ - Licensed to the Apache Software Foundation (ASF) under one
+ - or more contributor license agreements. See the NOTICE file
+ - distributed with this work for additional information
+ - regarding copyright ownership. The ASF licenses this file
+ - to you under the Apache License, Version 2.0 (the
+ - "License"); you may not use this file except in compliance
+ - with the License. You may obtain a copy of the License at
+ -
+ - http://www.apache.org/licenses/LICENSE-2.0
+ -
+ - Unless required by applicable law or agreed to in writing,
+ - software distributed under the License is distributed on an
+ - "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ - KIND, either express or implied. See the License for the
+ - specific language governing permissions and limitations
+ - under the License.
+ -->
+
+# Parquet Core Features
+
+This document lists the core features for each parquet-format release. This
+list is a subset of the features which parquet-format makes available.
+
+## Purpose
+
+The list of core features for a certian release makes a compliance level that
+the different implementations can tied to. If an implementation claims that it
+provides the functionality of a parquet-format release core features it must
+implement all of the listed features according the specification (both read and
+write path). This way it is easier to ensure compatibility between the
+different parquet implementations.
+We cannot and don't want to stop our clients to use any features that are not
+on this list but it shall be highlighted that using these features might make
+the written parquet files unreadable by other implementations. We can say that
+the features available in a parquet-format release (and one of the
+implementations of it) and not on this list are experimental.
+
+## Versioning
+
+This document is versioned by the parquet-format releases which follows the
+scheme of semantic versioning. It means that no feature will be deleted from
+this document under the same major version. (We might deprecate some, though.)
+Because of the semantic versioning if one implementation supports the core
+features of the parquet-format release `a.b.x` it must be able to read any
+parquet files written by implementations supporting the release `a.d.y` where
+`b >= d`.
+
+If a parquet file is written according to a released version of this document
+it might be a good idea to write this version into the field `compliance_level`
+in the thrift object `FileMetaData`.
+
+## Adding new features
+
+The idea is to only include features which are specified correctly and proven
+to be useful for everyone. Because of that we require to have at least two
+different implementations that are released and widely tested.
+
+## The "list"
+
+This list is based on the [parquet thrift file](src/main/thrift/parquet.thrift)
+where all the data structures we might use in a parquet file are defined.
+
+### File structure
+
+All of the required fields in the structure (and sub-structures) of
+`FileMetaData` must be set according to the specification.
+The following page types are supported:
+* Data page V1 (see `DataPageHeader`)
+* Dictionary page (see `DictionaryPageHeader`)
Review Comment:
Trino also supports reading V2 data pages. However, when we attempted to use
it in our parquet writer, we discovered other engines failed to read it
https://github.com/trinodb/trino/issues/6377
There was confusion about whether V2 is officially finalised
https://github.com/trinodb/trino/issues/7953#issuecomment-872544273
I believe support for V2 pages in other engines has improved since then.
If it's simply a matter of adoption, then it would help to have some clarity
about it in the spec (like for example acknowledging it as a core feature here)
so that someone implementing a parquet writer can be assured that writing V2
pages is not some experimental feature that can be deprecated from the format.
> Define core features / compliance level
> ---------------------------------------
>
> Key: PARQUET-1950
> URL: https://issues.apache.org/jira/browse/PARQUET-1950
> Project: Parquet
> Issue Type: New Feature
> Components: parquet-format
> Reporter: Gabor Szadovszky
> Assignee: Gabor Szadovszky
> Priority: Major
>
> Parquet format is getting more and more features while the different
> implementations cannot keep the pace and left behind with some features
> implemented and some are not. In many cases it is also not clear if the
> related feature is mature enough to be used widely or more an experimental
> one.
> These are huge issues that makes hard ensure interoperability between the
> different implementations.
> The following idea came up in a
> [discussion|https://lists.apache.org/thread.html/rde5cba8443487bccd47593ddf5dfb39f69c729d260165cb936a1a289%40%3Cdev.parquet.apache.org%3E].
> Create a now document in the parquet-format repository that lists the "core
> features". This document is versioned by the parquet-format releases. This
> way a certain version of "core features" defines a level of compatibility
> between the different implementations. This version number can be written to
> a new field (e.g. complianceLevel) in the footer. If an implementation writes
> a file with a version in the field it must implement all the related "core
> features" (read and write) and must not use any other features at write
> because it makes the data unreadable by another implementation if only the
> same level of "core features" are implemented.
> For example if we have encoding A listed in the version 1 "core features" but
> encoding B is not then at "complianceLevel = 1" we can use encoding A but we
> cannot use encoding B because it would make the related data unreadable.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)