emkornfield commented on code in PR #250: URL: https://github.com/apache/parquet-format/pull/250#discussion_r1619192894
########## README.md: ########## @@ -118,6 +118,65 @@ chunks they are interested in. The columns chunks should then be read sequentia  + ### PAR3 File Footers + + PAR3 file footer footer format designed to better support wider-schemas and more control + over the various footer size vs compute trade-offs. Its format is as follows: + - Data pages containing serialized Thrift metadata objects that were modeled as lists + in PAR1.These are stored contiguously with offsets stored in the FileMetadata. See + parquet.thrift for more details on each. + - Serialized Thrift FileMetadata Structure + - (Optional) 4 byte CRC32 of the serialized Thrift FileMetadata. + - 4-byte length in bytes (little endian) of the serialized FileMetadata structure. + - 4-byte length in bytes (little endian) of all preceding elements in the footer. + - 1 byte flag field to indicate features that require special parsing of the footer. + Readers MUST raise an error if there is an unrecognized flag. Current flags: + + * 0x01 - Footer encryption enabled (when set the encryption information is written before + FileMeta structure as in the PAR1 footer). + * 0x02 - CRC32 of FileMetadata Footer. + + - 4-byte magic number "PAR3" + + When parsing the footer implementations SHOULD read at least the last 10 bytes of the footer. Then + read in the entirety of the footer based on the length of all preceding elements. This prevents further + I/O cost for accessing metadata stored in the data pages. PAR3 footers can fully replace PAR1 footers. + If a file is written with only PAR3 footer, implementation MUT write PAR3 as the first four bytes in + they file. PAR3 footers can also be written in a backwards compatible way after PAR1 Metadata + (see next section for details). + + #### Dual Mode PAR1 and PAR3 footers + + There is a desire to gradually rollout PAR3 footers to allow newer readers to take advantage of them, while + older readers can still properly parse the file. This section outlines a strategy to do this. + + As backgroud, Thrift structs are always serialized with a 0 trailing byte do delimit there ending. + Therefore for PAR1 written before PAR3 was introduced are always expect the files to have the following + trailing 9 bytes [0x00, x, x, x, x, P, A, R, 1] (where x can be any value). We also expect all compliant + Thrift parsers to only parse the first available FileMetadata message and stop consuming the stream once read. + Today, we don't believe that any Parquet readers validate that the entire "length in bytes of file metadata" Review Comment: I'll revise this and move the design considerations elsewhere. @alkis if this was the approached you sketched on https://github.com/apache/parquet-format/pull/242/files#r1607838732 since I'm revising this anyways, i can incorporate it into the PR so we can consolidate discussion. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
