[jira] [Commented] (PARQUET-1950) Define core features / compliance level

ASF GitHub Bot (Jira) Wed, 31 May 2023 09:18:13 -0700


    [ 
https://issues.apache.org/jira/browse/PARQUET-1950?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17728054#comment-17728054
 ]


ASF GitHub Bot commented on PARQUET-1950:
-----------------------------------------

raunaqmorarka commented on code in PR #164:
URL: https://github.com/apache/parquet-format/pull/164#discussion_r1211980698


##########
CoreFeatures.md:
##########
@@ -0,0 +1,181 @@
+<!--
+  - Licensed to the Apache Software Foundation (ASF) under one
+  - or more contributor license agreements.  See the NOTICE file
+  - distributed with this work for additional information
+  - regarding copyright ownership.  The ASF licenses this file
+  - to you under the Apache License, Version 2.0 (the
+  - "License"); you may not use this file except in compliance
+  - with the License.  You may obtain a copy of the License at
+  -
+  -   http://www.apache.org/licenses/LICENSE-2.0
+  -
+  - Unless required by applicable law or agreed to in writing,
+  - software distributed under the License is distributed on an
+  - "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+  - KIND, either express or implied.  See the License for the
+  - specific language governing permissions and limitations
+  - under the License.
+  -->
+
+# Parquet Core Features
+
+This document lists the core features for each parquet-format release. This
+list is a subset of the features which parquet-format makes available.
+
+## Purpose
+
+The list of core features for a certian release makes a compliance level that
+the different implementations can tied to. If an implementation claims that it
+provides the functionality of a parquet-format release core features it must
+implement all of the listed features according the specification (both read and
+write path). This way it is easier to ensure compatibility between the
+different parquet implementations.
+We cannot and don't want to stop our clients to use any features that are not
+on this list but it shall be highlighted that using these features might make
+the written parquet files unreadable by other implementations. We can say that
+the features available in a parquet-format release (and one of the
+implementations of it) and not on this list are experimental.
+
+## Versioning
+
+This document is versioned by the parquet-format releases which follows the
+scheme of semantic versioning. It means that no feature will be deleted from
+this document under the same major version. (We might deprecate some, though.)
+Because of the semantic versioning if one implementation supports the core
+features of the parquet-format release `a.b.x` it must be able to read any
+parquet files written by implementations supporting the release `a.d.y` where
+`b >= d`.
+
+If a parquet file is written according to a released version of this document
+it might be a good idea to write this version into the field `compliance_level`
+in the thrift object `FileMetaData`.
+
+## Adding new features
+
+The idea is to only include features which are specified correctly and proven
+to be useful for everyone. Because of that we require to have at least two
+different implementations that are released and widely tested.
+
+## The "list"
+
+This list is based on the [parquet thrift file](src/main/thrift/parquet.thrift)
+where all the data structures we might use in a parquet file are defined.
+
+### File structure
+
+All of the required fields in the structure (and sub-structures) of
+`FileMetaData` must be set according to the specification.
+The following page types are supported:
+* Data page V1 (see `DataPageHeader`)
+* Dictionary page (see `DictionaryPageHeader`)

Review Comment:
   Trino also supports reading V2 data pages. However, when we attempted to use 
it in our parquet writer, we discovered other engines failed to read it 
https://github.com/trinodb/trino/issues/6377
   There was confusion about whether V2 is officially finalised 
https://github.com/trinodb/trino/issues/7953#issuecomment-872544273
   I believe support for V2 pages in other engines has improved since then.
   If it's simply a matter of adoption, then it would help to have some clarity 
about it in the spec (like for example acknowledging it as a core feature here) 
so that someone implementing a parquet writer can be assured that writing V2 
pages is not some experimental feature that can be deprecated from the format.





> Define core features / compliance level
> ---------------------------------------
>
>                 Key: PARQUET-1950
>                 URL: https://issues.apache.org/jira/browse/PARQUET-1950
>             Project: Parquet
>          Issue Type: New Feature
>          Components: parquet-format
>            Reporter: Gabor Szadovszky
>            Assignee: Gabor Szadovszky
>            Priority: Major
>
> Parquet format is getting more and more features while the different 
> implementations cannot keep the pace and left behind with some features 
> implemented and some are not. In many cases it is also not clear if the 
> related feature is mature enough to be used widely or more an experimental 
> one.
> These are huge issues that makes hard ensure interoperability between the 
> different implementations.
> The following idea came up in a 
> [discussion|https://lists.apache.org/thread.html/rde5cba8443487bccd47593ddf5dfb39f69c729d260165cb936a1a289%40%3Cdev.parquet.apache.org%3E].
>  Create a now document in the parquet-format repository that lists the "core 
> features". This document is versioned by the parquet-format releases. This 
> way a certain version of "core features" defines a level of compatibility 
> between the different implementations. This version number can be written to 
> a new field (e.g. complianceLevel) in the footer. If an implementation writes 
> a file with a version in the field it must implement all the related "core 
> features" (read and write) and must not use any other features at write 
> because it makes the data unreadable by another implementation if only the 
> same level of "core features" are implemented.
> For example if we have encoding A listed in the version 1 "core features" but 
> encoding B is not then at "complianceLevel = 1" we can use encoding A but we 
> cannot use encoding B because it would make the related data unreadable.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (PARQUET-1950) Define core features / compliance level

Reply via email to