[ https://issues.apache.org/jira/browse/PARQUET-1950?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17728041#comment-17728041 ]
ASF GitHub Bot commented on PARQUET-1950: ----------------------------------------- raunaqmorarka commented on code in PR #164: URL: https://github.com/apache/parquet-format/pull/164#discussion_r1211954870 ########## CoreFeatures.md: ########## @@ -0,0 +1,187 @@ +<!-- + - Licensed to the Apache Software Foundation (ASF) under one + - or more contributor license agreements. See the NOTICE file + - distributed with this work for additional information + - regarding copyright ownership. The ASF licenses this file + - to you under the Apache License, Version 2.0 (the + - "License"); you may not use this file except in compliance + - with the License. You may obtain a copy of the License at + - + - http://www.apache.org/licenses/LICENSE-2.0 + - + - Unless required by applicable law or agreed to in writing, + - software distributed under the License is distributed on an + - "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + - KIND, either express or implied. See the License for the + - specific language governing permissions and limitations + - under the License. + --> + +# Parquet Core Features + +This document lists the core features for each parquet-format release. This +list is a subset of the features which parquet-format makes available. + +## Purpose + +The list of core features for a certain release makes a compliance level for +implementations. If a writer implementation claims that it is at a certain +compliance level then it must use only features from the *core feature list* of +that parquet-format release. If a reader implementation claims the same if must +implement all of the listed features. This way it is easier to ensure +compatibility between the different Parquet implementations. + +We cannot and don't want to stop our clients to use any features that are not +on this list but it shall be highlighted that using these features might make +the written Parquet files unreadable by other implementations. We can say that +the features available in a parquet-format release (and one of the +implementations of it) and not on the *core feature list* are experimental. + +## Versioning + +This document is versioned by the parquet-format releases which follows the +scheme of semantic versioning. It means that no feature will be deleted from +this document under the same major version. (We might deprecate some, though.) +Because of the semantic versioning if one implementation supports the core +features of the parquet-format release `a.c.x` it must be able to read any +Parquet files written by implementations supporting the release `a.b.y` where +`c >= b`. + +If a Parquet file is written according to a released version of this document +it might be a good idea to write this version into the field `compliance_level` +in the Thrift object `FileMetaData`. + +## Adding new features + +The idea is to only include features which are specified correctly and proven +to be useful for everyone. Because of that we require to have at least two +different implementations that are released and widely tested. We also require +to implement interoperability tests for that feature to prove one +implementation can read the data written by the other one and vice versa. + +## Core feature list + +This list is based on the [Parquet Thrift file](src/main/thrift/parquet.thrift) +where all the data structures we might use in a Parquet file are defined. + +### File structure + +All of the required fields in the structure (and sub-structures) of +`FileMetaData` must be set according to the specification. +The following page types are supported: +* Data page V1 (see `DataPageHeader`) +* Dictionary page (see `DictionaryPageHeader`) + +**TODO**: list optional fields that must be filled properly. + +#### Column chunk file reference + +The optional field `file_path` in the `ColumnChunk` object of the Parquet footer +(aka Parquet Thrift file) makes it available to reference an external file. This +option was used for different features like _summary files_ or +_external column chunks_. These features were never specified correctly and +they did not spread across the different implementations. Because of that we do +not include these features in this document and therefore the field `file_path` +is not supported. + +### Types + +#### Primitive types + +The following [primitive types](README.md#types) are supported +* `BOOLEAN` +* `INT32` +* `INT64` +* `FLOAT` +* `DOUBLE` +* `BYTE\_ARRAY` +* `FIXED\_LEN\_BYTE\_ARRAY` + +NOTE: The primitive type `INT96` is deprecated so it is intentionally not listed +here. + +#### Logical types + +The [logical type](LogicalTypes.md)s are practically annotations helping to +understand the related primitive type (or structure). Originally we have had +the `ConvertedType` enum in the Thrift file representing all the possible +logical types. After a while we realized it is hard to extend and so introduced +the `LogicalType` union. For backward compatibility reasons we allow to use the +old `ConvertedType` values according to the specified rules but we expect that +the logical types in the file schema are defined with `LogicalType` objects. + +The following LogicalTypes are supported: +* `STRING` +* `MAP` +* `LIST` +* `ENUM` +* `DECIMAL` (for which primitives?) +* `DATE` +* `TIME`: **(Which unit, utc?)** +* `TIMESTAMP`: **(Which unit, utc?)** +* `INTEGER`: (all bitwidth 8, 16, 32, 64) **(unsigned?)** +* `UNKNOWN` **(?)** +* `JSON` **(?)** +* `BSON` **(?)** +* `UUID` **(?)** + +NOTE: The old ConvertedType `INTERVAL` has no representation in LogicalTypes. +This is becasue `INTERVAL` is deprecated so we do not include it in this list. + +### Encodings + +The following encodings are supported: +* [PLAIN](Encodings.md#plain-plain--0) + parquet-mr: Basically all value types are written in this encoding in case of + V1 pages +* [PLAIN\_DICTIONARY](Encodings.md#dictionary-encoding-plain_dictionary--2-and-rle_dictionary--8) + **(?)** + parquet-mr: As per the spec this encoding is deprecated while we still use it + for V1 page dictionaries. +* [RLE](Encodings.md#run-length-encoding--bit-packing-hybrid-rle--3) + parquet-mr: Used for both V1 and V2 pages to encode RL and DL and for BOOLEAN + values in case of V2 pages +* [DELTA\_BINARY\_PACKED](Encodings.md#delta-encoding-delta_binary_packed--5) + **(?)** + parquet-mr: Used for V2 pages to encode INT32 and INT64 values. +* [DELTA\_LENGTH\_BYTE\_ARRAY](Encodings.md#delta-length-byte-array-delta_length_byte_array--6) + **(?)** + parquet-mr: Not used directly Review Comment: That might be true from the point of view of compression, however the decoder logic is more complicated with DELTA_BYTE_ARRAY and decoding speed could be slower. If DELTA_BYTE_ARRAY is always the better choice, the spec should instead recommend that DELTA_BYTE_ARRAY should always be preferred over PLAIN for byte array columns and explicitly discourage using DELTA_LENGTH_BYTE_ARRAY directly as an encoding for byte arrays. > Define core features / compliance level > --------------------------------------- > > Key: PARQUET-1950 > URL: https://issues.apache.org/jira/browse/PARQUET-1950 > Project: Parquet > Issue Type: New Feature > Components: parquet-format > Reporter: Gabor Szadovszky > Assignee: Gabor Szadovszky > Priority: Major > > Parquet format is getting more and more features while the different > implementations cannot keep the pace and left behind with some features > implemented and some are not. In many cases it is also not clear if the > related feature is mature enough to be used widely or more an experimental > one. > These are huge issues that makes hard ensure interoperability between the > different implementations. > The following idea came up in a > [discussion|https://lists.apache.org/thread.html/rde5cba8443487bccd47593ddf5dfb39f69c729d260165cb936a1a289%40%3Cdev.parquet.apache.org%3E]. > Create a now document in the parquet-format repository that lists the "core > features". This document is versioned by the parquet-format releases. This > way a certain version of "core features" defines a level of compatibility > between the different implementations. This version number can be written to > a new field (e.g. complianceLevel) in the footer. If an implementation writes > a file with a version in the field it must implement all the related "core > features" (read and write) and must not use any other features at write > because it makes the data unreadable by another implementation if only the > same level of "core features" are implemented. > For example if we have encoding A listed in the version 1 "core features" but > encoding B is not then at "complianceLevel = 1" we can use encoding A but we > cannot use encoding B because it would make the related data unreadable. -- This message was sent by Atlassian Jira (v8.20.10#820010)