pitrou commented on code in PR #242:
URL: https://github.com/apache/parquet-format/pull/242#discussion_r1613050965


##########
src/main/thrift/parquet.thrift:
##########
@@ -1165,6 +1317,62 @@ struct FileMetaData {
   9: optional binary footer_signing_key_metadata
 }
 
+/** Metadata for a column in this file. */
+struct FileColumnMetadataV3 {
+  /** All column chunks in this file (one per row group) **/
+  1: required list<ColumnChunkV3> columns

Review Comment:
   > I think this leaves us at the mercy of thrift parsers with respect to 
additional copies
   
   This is rather strong wording :-) Memory copies are not the bane of 
performance, assuming they remain small-ish.
   
   > which might not be huge but seems to be something we can avoid if we move 
the list to a "page"
   
   At the cost of at least one additional indirect IO for each column chunk, 
which might not always be desirable?
   
   ---
   
   I'm trying to find a good tradeoff here. I think we have three reasonable 
options:
   1. Use file offsets for both row groups and column chunks.
   2. Use file offsets for row groups, and embedded Thrift-encoded binary for 
column chunks.
   3. Use embedded Thrift-encoded binary for row groups, and file offsets for 
column chunks.
   
   File metadata load speed should be fast for options 1 and 2, less so for 
option 3.
   
   Accessing the metadata for M columns in N row groups requires _O(M * N)_ IOs 
in option 1; _O(N)_ IOs in option 2; _O(M)_ IOs in option 3. Since N is 
typically much smaller than M (a row group is typically very large, you don't 
need to access many at once), it seems that option 2 is better than option 3 
here, though both are much better than option 1.
   
   _However_, some readers might be able to coalesce IOs together if they are 
contiguous enough. In this case, the difference between M and N would matter 
much less and options 2 and 3 could end up issuing a similar number of IOs to 
access column chunk metadata.
   
   _Caution_: this lacks a discussion of IO size. If column access patterns are 
typically sparse, option 2 might lose its attractiveness.
   
   I would therefore intuitively tend towards option 2, but that's not backed 
by numbers.
   
   Did I get something wrong in these speculations?
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to