This is an automated email from the ASF dual-hosted git repository.
emkornfield pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/parquet-format.git
The following commit(s) were added to refs/heads/master by this push:
new 5b564f3 PARQUET-2139: Deprecate ColumnChunk::file_offset field (#440)
5b564f3 is described below
commit 5b564f3c47679526cf72e54f207013f28f53acc4
Author: Ed Seidl <[email protected]>
AuthorDate: Wed Jul 3 12:58:57 2024 -0700
PARQUET-2139: Deprecate ColumnChunk::file_offset field (#440)
This field is not consistently set or read by implementations.
---
README.md | 26 +++++++++++++-------------
src/main/thrift/parquet.thrift | 19 ++++++++++++++-----
2 files changed, 27 insertions(+), 18 deletions(-)
diff --git a/README.md b/README.md
index 9567c63..d268b45 100644
--- a/README.md
+++ b/README.md
@@ -89,29 +89,29 @@ more pages.
This file and the [Thrift definition](src/main/thrift/parquet.thrift) should
be read together to understand the format.
4-byte magic number "PAR1"
- <Column 1 Chunk 1 + Column Metadata>
- <Column 2 Chunk 1 + Column Metadata>
+ <Column 1 Chunk 1>
+ <Column 2 Chunk 1>
...
- <Column N Chunk 1 + Column Metadata>
- <Column 1 Chunk 2 + Column Metadata>
- <Column 2 Chunk 2 + Column Metadata>
+ <Column N Chunk 1>
+ <Column 1 Chunk 2>
+ <Column 2 Chunk 2>
...
- <Column N Chunk 2 + Column Metadata>
+ <Column N Chunk 2>
...
- <Column 1 Chunk M + Column Metadata>
- <Column 2 Chunk M + Column Metadata>
+ <Column 1 Chunk M>
+ <Column 2 Chunk M>
...
- <Column N Chunk M + Column Metadata>
+ <Column N Chunk M>
File Metadata
4-byte length in bytes of file metadata (little endian)
4-byte magic number "PAR1"
In the above example, there are N columns in this table, split into M row
-groups. The file metadata contains the locations of all the column metadata
+groups. The file metadata contains the locations of all the column chunk
start locations. More details on what is contained in the metadata can be
found
in the Thrift definition.
-Metadata is written after the data to allow for single pass writing.
+File Metadata is written after the data to allow for single pass writing.
Readers are expected to first read the file metadata to find all the column
chunks they are interested in. The columns chunks should then be read
sequentially.
@@ -119,8 +119,8 @@ chunks they are interested in. The columns chunks should
then be read sequentia

## Metadata
-There are three types of metadata: file metadata, column (chunk) metadata and
page
-header metadata. All thrift structures are serialized using the
TCompactProtocol.
+There are two types of metadata: file metadata and page header metadata. All
thrift structures
+are serialized using the TCompactProtocol.

diff --git a/src/main/thrift/parquet.thrift b/src/main/thrift/parquet.thrift
index 934b3ca..9e83529 100644
--- a/src/main/thrift/parquet.thrift
+++ b/src/main/thrift/parquet.thrift
@@ -867,12 +867,21 @@ struct ColumnChunk {
**/
1: optional string file_path
- /** Byte offset in file_path to the ColumnMetaData **/
- 2: required i64 file_offset
+ /** Deprecated: Byte offset in file_path to the ColumnMetaData
+ *
+ * Past use of this field has been inconsistent, with some implementations
+ * using it to point to the ColumnMetaData and some using it to point to
+ * the first page in the column chunk. In many cases, the ColumnMetaData at
this
+ * location is wrong. This field is now deprecated and should not be used.
+ * Writers should set this field to 0 if no ColumnMetaData has been written
outside
+ * the footer.
+ */
+ 2: required i64 file_offset = 0
- /** Column metadata for this chunk. This is the same content as what is at
- * file_path/file_offset. Having it here has it replicated in the file
- * metadata.
+ /** Column metadata for this chunk. Some writers may also replicate this at
the
+ * location pointed to by file_path/file_offset.
+ * Note: while marked as optional, this field is in fact required by most
major
+ * Parquet implementations. As such, writers MUST populate this field.
**/
3: optional ColumnMetaData meta_data