wgtmac commented on code in PR #242:
URL: https://github.com/apache/parquet-format/pull/242#discussion_r1609271691
##########
src/main/thrift/parquet.thrift:
##########
@@ -1165,6 +1317,62 @@ struct FileMetaData {
9: optional binary footer_signing_key_metadata
}
+/** Metadata for a column in this file. */
+struct FileColumnMetadataV3 {
+ /** All column chunks in this file (one per row group) **/
+ 1: required list<ColumnChunkV3> columns
+
+ /** Sort order used for the Statistics min_value and max_value fields
+ **/
+ 2: optional ColumnOrder column_order;
+
+ /** NEW from v1: Optional key/value metadata for this column at the file
level
+ **/
+ 3: optional list<KeyValue> key_value_metadata
+}
+
+struct FileMetaDataV3 {
+ /** Version of this file **/
+ 1: required i32 version
+
+ /** Parquet schema for this file **/
+ 2: required list<SchemaElementV3> schema;
+
+ /** Number of rows in this file **/
+ 3: required i64 num_rows
+
+ /** Row groups in this file **/
+ 4: required list<RowGroupV3> row_groups
+
+ /** Optional key/value metadata for this file. **/
+ 5: optional list<KeyValue> key_value_metadata
+
+ /** String for application that wrote this file. **/
+ 6: optional string created_by
+
+ /** NEW from v1: byte offset of FileColumnMetadataV3, for each column **/
+ 7: required list<i64> file_column_metadata_offset;
+ /** NEW from v1: byte length of FileColumnMetadataV3, for each column **/
+ 8: required list<i32> file_column_metadata_length;
Review Comment:
Is it too late to add something like below for all offset/length pair?
```
struct Reference {
1: i64 offset
2: i32 length
}
```
##########
src/main/thrift/parquet.thrift:
##########
@@ -914,6 +1040,32 @@ struct RowGroup {
7: optional i16 ordinal
}
+struct RowGroupV3 {
+ /** REMOVED from v1: columns.
+ * Instead, decode each FileColumnMetadataV3 individually as needed.
+ */
+
+ /** Total byte size of all the uncompressed column data in this row group **/
+ 1: required i64 total_byte_size
+
+ /** Number of rows in this row group **/
+ 2: required i64 num_rows
+
+ /** If set, specifies a sort ordering of the rows in this row group. */
+ 3: optional list<SortingColumn> sorting_columns
Review Comment:
> I think this is kind of the same rationale as the codecs (easily being
able to concatenate row groups with different sort orders)
Yes, but it can be worked around by simply dropping sort order metadata if
row groups have inconsistent values. This is different to codec.
##########
src/main/thrift/parquet.thrift:
##########
@@ -885,6 +973,44 @@ struct ColumnChunk {
9: optional binary encrypted_column_metadata
}
+struct ColumnChunkV3 {
+ /** File where column data is stored. **/
+ 1: optional string file_path
+
+ /** Byte offset in file_path to the ColumnChunkMetaDataV3, optionally
encrypted
+ ** CHANGED from v1: renamed to metadata_file_offset
+ **/
+ 2: required i64 metadata_file_offset
+
+ /** NEW from v1: Byte length in file_path of ColumnChunkMetaDataV3,
optionally encrypted
+ **/
+ 3: required i32 metadata_file_length
Review Comment:
BTW, do we still want to put a copy `ColumnMetaData` at the end of column
chunk and another copy here at `4: optional binary encoded_metadata`? I know it
is good to keep backward compatibility. Does any implementation actually read
it from the end of column chunk?
##########
src/main/thrift/parquet.thrift:
##########
@@ -1165,6 +1317,62 @@ struct FileMetaData {
9: optional binary footer_signing_key_metadata
}
+/** Metadata for a column in this file. */
+struct FileColumnMetadataV3 {
+ /** All column chunks in this file (one per row group) **/
+ 1: required list<ColumnChunkV3> columns
Review Comment:
> One big problem with the current thrift structure is that we don't really
know the (variable) size of a column chunk
Same here. IIUC, the spec does not prohibit placing column chunks in random
order. This is something that we want to keep and would be good if we get the
size in a cheap way.
##########
src/main/thrift/parquet.thrift:
##########
@@ -914,6 +1040,32 @@ struct RowGroup {
7: optional i16 ordinal
}
+struct RowGroupV3 {
+ /** REMOVED from v1: columns.
+ * Instead, decode each FileColumnMetadataV3 individually as needed.
+ */
+
+ /** Total byte size of all the uncompressed column data in this row group **/
+ 1: required i64 total_byte_size
+
+ /** Number of rows in this row group **/
+ 2: required i64 num_rows
+
+ /** If set, specifies a sort ordering of the rows in this row group. */
+ 3: optional list<SortingColumn> sorting_columns
+
+ /** REMOVED from v1: file_offset.
+ * Use the OffsetIndex for each column instead.
Review Comment:
IIUC, the current spec does not prohibit placing column chunks in random
order. So we must use `file_offset` together with `total_compressed_size` to
determine the read range. This is a trick used to place small column chunks
together which may be fetched in a single I/O.
##########
src/main/thrift/parquet.thrift:
##########
@@ -1165,6 +1317,62 @@ struct FileMetaData {
9: optional binary footer_signing_key_metadata
}
+/** Metadata for a column in this file. */
+struct FileColumnMetadataV3 {
+ /** All column chunks in this file (one per row group) **/
+ 1: required list<ColumnChunkV3> columns
+
+ /** Sort order used for the Statistics min_value and max_value fields
+ **/
+ 2: optional ColumnOrder column_order;
+
+ /** NEW from v1: Optional key/value metadata for this column at the file
level
+ **/
+ 3: optional list<KeyValue> key_value_metadata
Review Comment:
Putting all user-defined metadata in a list is subject to limitations from
thrift. That's why we have to care about its size. Switching it to separate
"page" or something may unblock other potentials like putting extra
user-defined secondary index. For now, we can only choose a "black hole" from
somewhere in the file and put its offset/length pair into the
`key_value_metadata` if we want to add custom index.
##########
src/main/thrift/parquet.thrift:
##########
@@ -1165,6 +1317,62 @@ struct FileMetaData {
9: optional binary footer_signing_key_metadata
}
+/** Metadata for a column in this file. */
+struct FileColumnMetadataV3 {
+ /** All column chunks in this file (one per row group) **/
+ 1: required list<ColumnChunkV3> columns
+
+ /** Sort order used for the Statistics min_value and max_value fields
+ **/
+ 2: optional ColumnOrder column_order;
+
+ /** NEW from v1: Optional key/value metadata for this column at the file
level
+ **/
+ 3: optional list<KeyValue> key_value_metadata
+}
+
+struct FileMetaDataV3 {
+ /** Version of this file **/
+ 1: required i32 version
+
+ /** Parquet schema for this file **/
+ 2: required list<SchemaElementV3> schema;
+
+ /** Number of rows in this file **/
+ 3: required i64 num_rows
+
+ /** Row groups in this file **/
+ 4: required list<RowGroupV3> row_groups
+
+ /** Optional key/value metadata for this file. **/
+ 5: optional list<KeyValue> key_value_metadata
+
+ /** String for application that wrote this file. **/
+ 6: optional string created_by
+
+ /** NEW from v1: byte offset of FileColumnMetadataV3, for each column **/
+ 7: required list<i64> file_column_metadata_offset;
+ /** NEW from v1: byte length of FileColumnMetadataV3, for each column **/
+ 8: required list<i32> file_column_metadata_length;
+
+ /** REMOVED from v1: column_orders.
+ ** Use `FileColumnMetadataV3.column_order` instead.
Review Comment:
+1
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]