Re: [PR] DRAFT: Parquet 3 metadata with decoupled column metadata [parquet-format]

via GitHub Tue, 21 May 2024 02:00:16 -0700


alkis commented on code in PR #242:
URL: https://github.com/apache/parquet-format/pull/242#discussion_r1607911563



##########
src/main/thrift/parquet.thrift:
##########
@@ -835,6 +864,65 @@ struct ColumnMetaData {
   16: optional SizeStatistics size_statistics;
 }
 
+struct ColumnChunkMetaDataV3 {

Review Comment:
   Similarly to SchemaElement, we can reuse the existing struct and deprecate 
the useless fields.



##########
README.md:
##########
@@ -107,12 +113,97 @@ start locations.  More details on what is contained in 
the metadata can be found
 in the Thrift definition.
 
 Metadata is written after the data to allow for single pass writing.
+This is especially useful when writing to backends such as S3.
 
 Readers are expected to first read the file metadata to find all the column
 chunks they are interested in.  The columns chunks should then be read 
sequentially.
 
  ![File 
Layout](https://raw.github.com/apache/parquet-format/master/doc/images/FileLayout.gif)
 
+### Parquet 3
+
+Parquet 3 files have the following overall structure:
+
+```
+4-byte magic number "PAR1"
+4-byte magic number "PAR3"
+
+<Column 1 Chunk 1 + Column Metadata>
+<Column 2 Chunk 1 + Column Metadata>
+...
+<Column N Chunk 1 + Column Metadata>
+<Column 1 Chunk 2 + Column Metadata>
+<Column 2 Chunk 2 + Column Metadata>
+...
+<Column N Chunk 2 + Column Metadata>
+...
+<Column 1 Chunk M + Column Metadata>
+<Column 2 Chunk M + Column Metadata>
+...
+<Column N Chunk M + Column Metadata>
+
+<File-level Column 1 Metadata v3>
+...
+<File-level Column N Metadata v3>
+
+File Metadata v3
+4-byte length in bytes of File Metadata v3 (little endian)
+4-byte magic number "PAR3"
+
+File Metadata
+4-byte length in bytes of File Metadata (little endian)
+4-byte magic number "PAR1"
+```
+
+Unlike the legacy File Metadata, the File Metadata v3 is designed to be 
light-weight
+to decode, regardless of the number of columns in the file. Individual column
+metadata can be opportunistically decoded depending on actual needs.
+
+This file structure is backwards-compatible. Parquet 1 readers will read and
+decode the legacy File Metadata in the file footer, while Parquet 3 readers
+will notice the "PAR3" magic number just before the File Metadata and will
+instead read and decode the File Metadata v3.

Review Comment:
   Efficient I/O can be challenging with this proposal. A reader needs to read 
the last 8 bytes of the file, then read 8 bytes before the legacy footer, 
figure out a v3 footer exists and then read that.
   
   It would be better if the v3 metadata are at the end of the file, right 
before the 4-byte len + PAR1.



##########
src/main/thrift/parquet.thrift:
##########
@@ -467,6 +467,35 @@ struct SchemaElement {
   10: optional LogicalType logicalType
 }
 
+struct SchemaElementV3 {
+  /** Data type for this field. */
+  1: optional Type type;
+
+  /** If type is FIXED_LEN_BYTE_ARRAY, this is the byte length of the values.
+   *
+   * CHANGED from v1: this must be omitted for other column types.
+   */
+  2: optional i32 type_length;
+
+  /** repetition of the field. */
+  3: optional FieldRepetitionType repetition_type;
+
+  /** Name of the field in the schema */
+  4: required string name;
+
+  /** Nested fields. */
+  5: optional i32 num_children;
+
+  /** CHANGED from v1: from i32 to i64

Review Comment:
   Same. I don't think this change is needed.



##########
src/main/thrift/parquet.thrift:
##########
@@ -835,6 +864,65 @@ struct ColumnMetaData {
   16: optional SizeStatistics size_statistics;
 }
 
+struct ColumnChunkMetaDataV3 {
+  /** REMOVED from v1: type (redundant with SchemaElementV3) */
+  /** REMOVED from v1: encodings (unnecessary and non-trivial to get right) */
+  /** REMOVED from v1: path_in_schema (unnecessary and wasteful) */
+  /** REMOVED from v1: index_page_offset (unused in practice?) */
+
+  /** Compression codec **/
+  1: required CompressionCodec codec

Review Comment:
   This could go in the `SchemaElement` since it should not in principle vary 
between row groups.



##########
src/main/thrift/parquet.thrift:
##########
@@ -467,6 +467,35 @@ struct SchemaElement {
   10: optional LogicalType logicalType
 }
 
+struct SchemaElementV3 {

Review Comment:
   +1 these changes are not necessary, we should mark the relevant bits in 
`SchemaElement` deprecated instead.



##########
src/main/thrift/parquet.thrift:
##########
@@ -885,6 +971,44 @@ struct ColumnChunk {
   9: optional binary encrypted_column_metadata
 }
 
+struct ColumnChunkV3 {
+  /** File where column data is stored. **/
+  1: optional string file_path

Review Comment:
   +1 on removing this if we are doing breaking changes.



##########
README.md:
##########
@@ -107,12 +113,97 @@ start locations.  More details on what is contained in 
the metadata can be found
 in the Thrift definition.
 
 Metadata is written after the data to allow for single pass writing.
+This is especially useful when writing to backends such as S3.
 
 Readers are expected to first read the file metadata to find all the column
 chunks they are interested in.  The columns chunks should then be read 
sequentially.
 
  ![File 
Layout](https://raw.github.com/apache/parquet-format/master/doc/images/FileLayout.gif)
 
+### Parquet 3
+
+Parquet 3 files have the following overall structure:
+
+```
+4-byte magic number "PAR1"
+4-byte magic number "PAR3"
+
+<Column 1 Chunk 1 + Column Metadata>
+<Column 2 Chunk 1 + Column Metadata>
+...
+<Column N Chunk 1 + Column Metadata>
+<Column 1 Chunk 2 + Column Metadata>
+<Column 2 Chunk 2 + Column Metadata>
+...
+<Column N Chunk 2 + Column Metadata>
+...
+<Column 1 Chunk M + Column Metadata>
+<Column 2 Chunk M + Column Metadata>
+...
+<Column N Chunk M + Column Metadata>
+
+<File-level Column 1 Metadata v3>
+...
+<File-level Column N Metadata v3>
+
+File Metadata v3
+4-byte length in bytes of File Metadata v3 (little endian)
+4-byte magic number "PAR3"
+
+File Metadata
+4-byte length in bytes of File Metadata (little endian)
+4-byte magic number "PAR1"
+```
+
+Unlike the legacy File Metadata, the File Metadata v3 is designed to be 
light-weight
+to decode, regardless of the number of columns in the file. Individual column
+metadata can be opportunistically decoded depending on actual needs.
+
+This file structure is backwards-compatible. Parquet 1 readers will read and
+decode the legacy File Metadata in the file footer, while Parquet 3 readers
+will notice the "PAR3" magic number just before the File Metadata and will
+instead read and decode the File Metadata v3.
+
+### Parquet 3 without legacy metadata
+
+It is possible, though not recommended for the time being, to produce Parquet 3
+files without the legacy metadata blocks:
+```
+4-byte magic number "PAR3"
+
+<Column 1 Chunk 1 + Column Metadata>
+<Column 2 Chunk 1 + Column Metadata>
+...
+<Column N Chunk 1 + Column Metadata>
+<Column 1 Chunk 2 + Column Metadata>
+<Column 2 Chunk 2 + Column Metadata>
+...
+<Column N Chunk 2 + Column Metadata>
+...
+<Column 1 Chunk M + Column Metadata>
+<Column 2 Chunk M + Column Metadata>
+...
+<Column N Chunk M + Column Metadata>
+
+<File-level Column 1 Metadata v3>
+...
+<File-level Column N Metadata v3>
+
+File Metadata v3
+4-byte length in bytes of File Metadata v3 (little endian)
+4-byte magic number "PAR3"

Review Comment:
   How do encrypted footers work in this case?



##########
src/main/thrift/parquet.thrift:
##########
@@ -835,6 +864,65 @@ struct ColumnMetaData {
   16: optional SizeStatistics size_statistics;
 }
 
+struct ColumnChunkMetaDataV3 {

Review Comment:
   Also don't we need a index into the list of `SchemaElement` to reference it?



##########
src/main/thrift/parquet.thrift:
##########


Review Comment:
   The biggest culprit in parsing metadata is `Statistics` because every one of 
the values is a binary variable length thing. We could improve `Statistics` 
trivially and in place if we add a few fixed size fields:
   ```thrift
   struct Statistics {
      /**
       * DEPRECATED
       */
      1: optional binary max;
      2: optional binary min;
      /** count of null value in the column */
      3: optional i64 null_count;
      /** count of distinct values occurring */
      4: optional i64 distinct_count;
      /**
       * Only one pair of min/max will be populated. For fixed sized types, one 
of the minN/maxN variants will be used. Otherwise min_value/max_value is used.
       */
      5: optional binary max_value;
      6: optional binary min_value;
   
      7: optional byte max1;
      8: optional byte min1;
      9: optional i32 max4;
      10: optional i32 min4;
      11: optional i64 max8;
      12: optional i64 min8;
   }
   ```



##########
src/main/thrift/parquet.thrift:
##########
@@ -1165,6 +1317,62 @@ struct FileMetaData {
   9: optional binary footer_signing_key_metadata
 }
 
+/** Metadata for a column in this file. */
+struct FileColumnMetadataV3 {
+  /** All column chunks in this file (one per row group) **/
+  1: required list<ColumnChunkV3> columns
+
+  /** Sort order used for the Statistics min_value and max_value fields
+   **/
+  2: optional ColumnOrder column_order;
+
+  /** NEW from v1: Optional key/value metadata for this column at the file 
level
+   **/
+  3: optional list<KeyValue> key_value_metadata
+}
+
+struct FileMetaDataV3 {
+  /** Version of this file **/
+  1: required i32 version
+
+  /** Parquet schema for this file **/
+  2: required list<SchemaElementV3> schema;
+
+  /** Number of rows in this file **/
+  3: required i64 num_rows
+
+  /** Row groups in this file **/
+  4: required list<RowGroupV3> row_groups

Review Comment:
   Perhaps that's true but is this something an engine would do other than when 
dealing with a sampling read or limit query?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] DRAFT: Parquet 3 metadata with decoupled column metadata [parquet-format]

Reply via email to