gszadovszky commented on code in PR #242: URL: https://github.com/apache/parquet-format/pull/242#discussion_r1603224854
########## README.md: ########## @@ -113,6 +119,55 @@ chunks they are interested in. The columns chunks should then be read sequentia  +### Parquet 3 + +Parquet 3 files have the following overall structure: + +``` +4-byte magic number "PAR1" +4-byte magic number "PAR3" +8-byte offset of File Metadata v3 +8-byte length of File Metadata v3 Review Comment: Doesn't it mean we need to keep all row groups in memory until we start closing the file to fill these values? That's why we originally had the footer length at the end of the file. ########## src/main/thrift/parquet.thrift: ########## @@ -885,6 +971,44 @@ struct ColumnChunk { 9: optional binary encrypted_column_metadata } +struct ColumnChunkV3 { + /** File where column data is stored. **/ + 1: optional string file_path Review Comment: Do we want to keep this concept of having the metadata in a separate file? I did not see it working anywhere. ########## README.md: ########## @@ -113,6 +119,55 @@ chunks they are interested in. The columns chunks should then be read sequentia  +### Parquet 3 + +Parquet 3 files have the following overall structure: + +``` +4-byte magic number "PAR1" +4-byte magic number "PAR3" +8-byte offset of File Metadata v3 +8-byte length of File Metadata v3 Review Comment: Wondering if there might be any systems/tools that takes advantage of the first page starts just after the magic number. Probably not a serious concern, though. ########## README.md: ########## @@ -113,6 +119,55 @@ chunks they are interested in. The columns chunks should then be read sequentia  +### Parquet 3 + +Parquet 3 files have the following overall structure: + +``` +4-byte magic number "PAR1" +4-byte magic number "PAR3" +8-byte offset of File Metadata v3 +8-byte length of File Metadata v3 Review Comment: An alternative way would be storing these between `File Matadata v3` and `File Metadata` ########## src/main/thrift/parquet.thrift: ########## @@ -835,6 +864,63 @@ struct ColumnMetaData { 16: optional SizeStatistics size_statistics; } +struct ColumnChunkMetaDataV3 { + /** REMOVED from v1: type (redundant with SchemaElementV3) */ + /** REMOVED from v1: encodings (unnecessary and non-trivial to get right) */ + /** REMOVED from v1: path_in_schema (unnecessary and wasteful) */ + /** REMOVED from v1: index_page_offset (unused in practice?) */ + /** REMOVED from v1: statistics (use ColumnIndex and/or page-level statistics instead) */ Review Comment: So we cannot filter on row groups easily/quickly anymore. Would be nice to see some benchmarks what we loose here. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
