emkornfield commented on code in PR #242: URL: https://github.com/apache/parquet-format/pull/242#discussion_r1603954347
########## README.md: ########## @@ -107,12 +113,97 @@ start locations. More details on what is contained in the metadata can be found in the Thrift definition. Metadata is written after the data to allow for single pass writing. +This is especially useful when writing to backends such as S3. Readers are expected to first read the file metadata to find all the column chunks they are interested in. The columns chunks should then be read sequentially.  +### Parquet 3 + +Parquet 3 files have the following overall structure: + +``` +4-byte magic number "PAR1" +4-byte magic number "PAR3" + +<Column 1 Chunk 1 + Column Metadata> +<Column 2 Chunk 1 + Column Metadata> +... +<Column N Chunk 1 + Column Metadata> +<Column 1 Chunk 2 + Column Metadata> +<Column 2 Chunk 2 + Column Metadata> +... +<Column N Chunk 2 + Column Metadata> +... +<Column 1 Chunk M + Column Metadata> +<Column 2 Chunk M + Column Metadata> +... +<Column N Chunk M + Column Metadata> + +<File-level Column 1 Metadata v3> +... +<File-level Column N Metadata v3> + +File Metadata v3 +4-byte length in bytes of File Metadata v3 (little endian) Review Comment: 1. A very small fraction likely, a lighter-weight digest is also fine, we have digests in other parts of the spec, and I think the main reasion for likely not having it on the footer was to avoid compatibility issues. 2. FileMetadata + Serialized metadata like indeces/bloom filters and anything we move to the data page. after all the column chunks is what I mean by "Full Footer" 3. It isn't clear to me that everyone will check the header. This adds an additional IO for not too much benefit unless, the entire file is being retrieved from disk. 4. Compressing the thrift serialized data to minimize size if consumers want the ultimate smallest file. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
