Re: [PR] DRAFT: Alternative Strawman proposal for a new V3 footer format in Parquet [parquet-format]

via GitHub Wed, 29 May 2024 09:42:43 -0700


emkornfield commented on code in PR #250:
URL: https://github.com/apache/parquet-format/pull/250#discussion_r1619192894



##########
README.md:
##########
@@ -118,6 +118,65 @@ chunks they are interested in.  The columns chunks should 
then be read sequentia
 
  ![File 
Layout](https://raw.github.com/apache/parquet-format/master/doc/images/FileLayout.gif)
 
+ ### PAR3 File Footers
+
+ PAR3 file footer footer format designed to better support wider-schemas and 
more control
+ over the various footer size vs compute trade-offs.  Its format is as follows:
+   - Data pages containing serialized Thrift metadata objects that were 
modeled as lists
+     in PAR1.These are stored contiguously with offsets stored in the 
FileMetadata. See 
+     parquet.thrift for more details on each.
+   - Serialized Thrift FileMetadata Structure
+   - (Optional) 4 byte CRC32 of the serialized Thrift FileMetadata.
+   - 4-byte length in bytes (little endian) of the serialized FileMetadata 
structure.
+   - 4-byte length in bytes (little endian) of all preceding elements in the 
footer.
+   - 1 byte flag field to indicate features that require special parsing of 
the footer.
+     Readers MUST raise an error if there is an unrecognized flag.  Current 
flags:
+
+     * 0x01 - Footer encryption enabled (when set the encryption information 
is written before 
+        FileMeta structure as in the PAR1 footer).
+     * 0x02 - CRC32 of FileMetadata Footer.
+
+   - 4-byte magic number "PAR3"
+
+  When parsing the footer implementations SHOULD read at least the last 10 
bytes of the footer. Then
+  read in the entirety of the footer based on the length of all preceding 
elements. This prevents further
+  I/O cost for accessing metadata stored in the data pages. PAR3 footers can 
fully replace PAR1 footers.
+  If a file is written with only PAR3 footer, implementation MUT write PAR3 as 
the first four bytes in 
+  they file. PAR3 footers can also be written in a backwards compatible way 
after PAR1 Metadata 
+  (see next section for details).
+
+  #### Dual Mode PAR1 and PAR3 footers
+
+  There is a desire to gradually rollout PAR3 footers to allow newer readers 
to take advantage of them, while
+  older readers can still properly parse the file. This section outlines a 
strategy to do this.
+
+  As backgroud, Thrift structs are always serialized with a 0 trailing byte do 
delimit there ending.  
+  Therefore for PAR1 written before PAR3 was introduced are always expect the 
files to have the following 
+  trailing 9 bytes [0x00, x, x, x, x, P, A, R, 1] (where x can be any value).  
We also expect all compliant 
+  Thrift parsers to only parse the first available FileMetadata message and 
stop consuming the stream once read. 
+  Today, we don't believe that any Parquet readers validate that the entire 
"length in bytes of file metadata" 

Review Comment:
   I'll revise this and move the design considerations elsewhere.
   
   @alkis if this was the approached you sketched on 
https://github.com/apache/parquet-format/pull/242/files#r1607838732 since I'm 
revising this anyways, i can incorporate it into the PR so we can consolidate 
discussion.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] DRAFT: Alternative Strawman proposal for a new V3 footer format in Parquet [parquet-format]

Reply via email to