emkornfield commented on code in PR #242:
URL: https://github.com/apache/parquet-format/pull/242#discussion_r1614017588
##########
src/main/thrift/parquet.thrift:
##########
@@ -1165,6 +1317,62 @@ struct FileMetaData {
9: optional binary footer_signing_key_metadata
}
+/** Metadata for a column in this file. */
+struct FileColumnMetadataV3 {
+ /** All column chunks in this file (one per row group) **/
+ 1: required list<ColumnChunkV3> columns
Review Comment:
I think assumptions on IOs depend exactly on layout of the footer. If it is
helpful I can write up a complete PR for my suggestions. But I was assuming a
PAR3 footer would like like:
`<Metadata page1><Metadat page2><Metadata page3><FileMetadata Thrift
footer><offset of metadata page1 from end of file (we can determine if this is
required or recommendation)><offset of FileMetata thrift footer from end of
file><crc of thrift metadata (or of all pages + thrift footer, but pages
already have coverage at least for there data component)><1 byte feature
bitmap>PAR3`
Given this footer the desired algorithm is to do 1 IO on the `offset of
metadata page1` to the end of the file if the entire operation speculated
footer read (e.g. 48KB is some implemenations) did not retrieve it all. After
that any of the referece for Option 1 should not incur I/O.
I agree, mem-copies aren't necessarily at the top of list now for
performance but if they can be avoided without undo complexity, why not? (undo
complexity I guess is in the eye of the beholder).
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]