Re: [PR] DRAFT: Parquet 3 metadata with decoupled column metadata [parquet-format]

via GitHub Wed, 22 May 2024 05:46:17 -0700


pitrou commented on code in PR #242:
URL: https://github.com/apache/parquet-format/pull/242#discussion_r1609886236



##########
src/main/thrift/parquet.thrift:
##########
@@ -1165,6 +1317,62 @@ struct FileMetaData {
   9: optional binary footer_signing_key_metadata
 }
 
+/** Metadata for a column in this file. */
+struct FileColumnMetadataV3 {
+  /** All column chunks in this file (one per row group) **/
+  1: required list<ColumnChunkV3> columns
+
+  /** Sort order used for the Statistics min_value and max_value fields
+   **/
+  2: optional ColumnOrder column_order;
+
+  /** NEW from v1: Optional key/value metadata for this column at the file 
level
+   **/
+  3: optional list<KeyValue> key_value_metadata
+}
+
+struct FileMetaDataV3 {
+  /** Version of this file **/
+  1: required i32 version
+
+  /** Parquet schema for this file **/
+  2: required list<SchemaElementV3> schema;
+
+  /** Number of rows in this file **/
+  3: required i64 num_rows
+
+  /** Row groups in this file **/
+  4: required list<RowGroupV3> row_groups
+
+  /** Optional key/value metadata for this file. **/
+  5: optional list<KeyValue> key_value_metadata
+
+  /** String for application that wrote this file. **/
+  6: optional string created_by
+
+  /** NEW from v1: byte offset of FileColumnMetadataV3, for each column **/
+  7: required list<i64> file_column_metadata_offset;
+  /** NEW from v1: byte length of FileColumnMetadataV3, for each column **/
+  8: required list<i32> file_column_metadata_length;

Review Comment:
   I was going with the same intuition as @etseidl 's, though I don't have any 
precise insight into typical Thrift deserializer performance characteristics.
   
   Mandating contiguous column metadata objects and using N+1 offsets is an 
intriguing idea. It could perhaps allow preloading many columns at once more 
easily.



##########
src/main/thrift/parquet.thrift:
##########
@@ -1165,6 +1317,62 @@ struct FileMetaData {
   9: optional binary footer_signing_key_metadata
 }
 
+/** Metadata for a column in this file. */
+struct FileColumnMetadataV3 {
+  /** All column chunks in this file (one per row group) **/
+  1: required list<ColumnChunkV3> columns
+
+  /** Sort order used for the Statistics min_value and max_value fields
+   **/
+  2: optional ColumnOrder column_order;
+
+  /** NEW from v1: Optional key/value metadata for this column at the file 
level
+   **/
+  3: optional list<KeyValue> key_value_metadata
+}
+
+struct FileMetaDataV3 {
+  /** Version of this file **/
+  1: required i32 version
+
+  /** Parquet schema for this file **/
+  2: required list<SchemaElementV3> schema;
+
+  /** Number of rows in this file **/
+  3: required i64 num_rows
+
+  /** Row groups in this file **/
+  4: required list<RowGroupV3> row_groups
+
+  /** Optional key/value metadata for this file. **/
+  5: optional list<KeyValue> key_value_metadata
+
+  /** String for application that wrote this file. **/
+  6: optional string created_by
+
+  /** NEW from v1: byte offset of FileColumnMetadataV3, for each column **/
+  7: required list<i64> file_column_metadata_offset;
+  /** NEW from v1: byte length of FileColumnMetadataV3, for each column **/
+  8: required list<i32> file_column_metadata_length;

Review Comment:
   I was going with the same intuition as @etseidl 's, though I don't have any 
precise insight into typical Thrift deserializer performance characteristics.
   
   Mandating contiguous column metadata objects and using N+1 offsets is an 
intriguing idea. It could perhaps allow preloading many column metadata at once 
more easily.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] DRAFT: Parquet 3 metadata with decoupled column metadata [parquet-format]

Reply via email to