(parquet-format) branch master updated: PARQUET-2139: Deprecate ColumnChunk::file_offset field (#440)

emkornfield Wed, 03 Jul 2024 13:01:03 -0700

This is an automated email from the ASF dual-hosted git repository.

emkornfield pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/parquet-format.git



The following commit(s) were added to refs/heads/master by this push:
     new 5b564f3  PARQUET-2139: Deprecate ColumnChunk::file_offset field (#440)
5b564f3 is described below

commit 5b564f3c47679526cf72e54f207013f28f53acc4
Author: Ed Seidl <[email protected]>
AuthorDate: Wed Jul 3 12:58:57 2024 -0700

    PARQUET-2139: Deprecate ColumnChunk::file_offset field (#440)
    
    This field is not consistently set or read by implementations.
---
 README.md                      | 26 +++++++++++++-------------
 src/main/thrift/parquet.thrift | 19 ++++++++++++++-----
 2 files changed, 27 insertions(+), 18 deletions(-)

diff --git a/README.md b/README.md
index 9567c63..d268b45 100644
--- a/README.md
+++ b/README.md
@@ -89,29 +89,29 @@ more pages.
 This file and the [Thrift definition](src/main/thrift/parquet.thrift) should 
be read together to understand the format.
 
     4-byte magic number "PAR1"
-    <Column 1 Chunk 1 + Column Metadata>
-    <Column 2 Chunk 1 + Column Metadata>
+    <Column 1 Chunk 1>
+    <Column 2 Chunk 1>
     ...
-    <Column N Chunk 1 + Column Metadata>
-    <Column 1 Chunk 2 + Column Metadata>
-    <Column 2 Chunk 2 + Column Metadata>
+    <Column N Chunk 1>
+    <Column 1 Chunk 2>
+    <Column 2 Chunk 2>
     ...
-    <Column N Chunk 2 + Column Metadata>
+    <Column N Chunk 2>
     ...
-    <Column 1 Chunk M + Column Metadata>
-    <Column 2 Chunk M + Column Metadata>
+    <Column 1 Chunk M>
+    <Column 2 Chunk M>
     ...
-    <Column N Chunk M + Column Metadata>
+    <Column N Chunk M>
     File Metadata
     4-byte length in bytes of file metadata (little endian)
     4-byte magic number "PAR1"
 
 In the above example, there are N columns in this table, split into M row
-groups.  The file metadata contains the locations of all the column metadata
+groups.  The file metadata contains the locations of all the column chunk
 start locations.  More details on what is contained in the metadata can be 
found
 in the Thrift definition.
 
-Metadata is written after the data to allow for single pass writing.
+File Metadata is written after the data to allow for single pass writing.
 
 Readers are expected to first read the file metadata to find all the column
 chunks they are interested in.  The columns chunks should then be read 
sequentially.
@@ -119,8 +119,8 @@ chunks they are interested in.  The columns chunks should 
then be read sequentia
  ![File 
Layout](https://raw.github.com/apache/parquet-format/master/doc/images/FileLayout.gif)
 
 ## Metadata
-There are three types of metadata: file metadata, column (chunk) metadata and 
page
-header metadata.  All thrift structures are serialized using the 
TCompactProtocol.
+There are two types of metadata: file metadata and page header metadata.  All 
thrift structures
+are serialized using the TCompactProtocol.
 
  ![Metadata 
diagram](https://github.com/apache/parquet-format/raw/master/doc/images/FileFormat.gif)
 
diff --git a/src/main/thrift/parquet.thrift b/src/main/thrift/parquet.thrift
index 934b3ca..9e83529 100644
--- a/src/main/thrift/parquet.thrift
+++ b/src/main/thrift/parquet.thrift
@@ -867,12 +867,21 @@ struct ColumnChunk {
     **/
   1: optional string file_path
 
-  /** Byte offset in file_path to the ColumnMetaData **/
-  2: required i64 file_offset
+  /** Deprecated: Byte offset in file_path to the ColumnMetaData
+   *
+   * Past use of this field has been inconsistent, with some implementations
+   * using it to point to the ColumnMetaData and some using it to point to
+   * the first page in the column chunk. In many cases, the ColumnMetaData at 
this
+   * location is wrong. This field is now deprecated and should not be used.
+   * Writers should set this field to 0 if no ColumnMetaData has been written 
outside
+   * the footer.
+   */
+  2: required i64 file_offset = 0
 
-  /** Column metadata for this chunk. This is the same content as what is at
-   * file_path/file_offset.  Having it here has it replicated in the file
-   * metadata.
+  /** Column metadata for this chunk. Some writers may also replicate this at 
the
+   * location pointed to by file_path/file_offset.
+   * Note: while marked as optional, this field is in fact required by most 
major
+   * Parquet implementations. As such, writers MUST populate this field.
    **/
   3: optional ColumnMetaData meta_data

(parquet-format) branch master updated: PARQUET-2139: Deprecate ColumnChunk::file_offset field (#440)

Reply via email to