Re: [PR] DRAFT: Parquet 3 metadata with decoupled column metadata [parquet-format]

via GitHub Fri, 24 May 2024 14:01:42 -0700


emkornfield commented on code in PR #242:
URL: https://github.com/apache/parquet-format/pull/242#discussion_r1614017588



##########
src/main/thrift/parquet.thrift:
##########
@@ -1165,6 +1317,62 @@ struct FileMetaData {
   9: optional binary footer_signing_key_metadata
 }
 
+/** Metadata for a column in this file. */
+struct FileColumnMetadataV3 {
+  /** All column chunks in this file (one per row group) **/
+  1: required list<ColumnChunkV3> columns

Review Comment:
   I think assumptions on IOs depend exactly on layout of the footer.  If it is 
helpful I can write up a complete PR for my suggestions.  But I was assuming a 
PAR3 footer would like like:
   
   `<Metadata page1><Metadat page2><Metadata page3><FileMetadata Thrift 
footer><offset of metadata page1 from end of file (we can determine if this is 
required or recommendation)><offset of FileMetata thrift footer from end of 
file><crc of thrift metadata (or of all pages + thrift footer, but pages 
already have coverage at least for there data component)><1 byte feature 
bitmap>PAR3`
   
   Given this footer the desired algorithm is to do 1 IO on the `offset of 
metadata page1` to the end of the file if the entire operation speculated 
footer read (e.g. 48KB is some implemenations) did not retrieve it all.  After 
that any of the referece for Option 1 should not incur I/O.
   
   I agree, mem-copies aren't necessarily at the top of list now for 
performance but if they can be avoided without undo complexity, why not? (undo 
complexity I guess is in the eye of the beholder).
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] DRAFT: Parquet 3 metadata with decoupled column metadata [parquet-format]

Reply via email to