Re: [PR] DRAFT: Parquet 3 metadata with decoupled column metadata [parquet-format]

via GitHub Tue, 21 May 2024 13:59:49 -0700


emkornfield commented on code in PR #242:
URL: https://github.com/apache/parquet-format/pull/242#discussion_r1608945075



##########
src/main/thrift/parquet.thrift:
##########
@@ -1165,6 +1317,62 @@ struct FileMetaData {
   9: optional binary footer_signing_key_metadata
 }
 
+/** Metadata for a column in this file. */
+struct FileColumnMetadataV3 {
+  /** All column chunks in this file (one per row group) **/
+  1: required list<ColumnChunkV3> columns

Review Comment:
   > > Instead of being an offset, I suppose this could just be modeled in the 
message as [bytes]
   > 
   > Sure, but what would that change exactly? You'll have to decode it anyway.
   
   If the question is specifically for bytes in thrift vs a page stored 
someplace else, I think the main trade-off is how easy it is to achieve zero 
copy on the underlying bytes that need to be decoded vs the complexity of 
handling the offset.  Decoding of just one a few large byte arrays does not 
have to be expensive, but in practice at least in C++ it is more expensive then 
it probably should be.  It seems java might actually have the ability to do 
zero copy here.
   
   If the question is why a data page (which still needs decoding), I was 
thinking if we introduced a new encoding which effectively uses the arrow byte 
array layout `([cumulative_offsets_into_byte_data], [byte_data])`, then 
decoding is really the of parsing the page header, and the page would provide 
random access to elements that could be decoding individually.  For end-users 
more concerned about overall space, then the page abstraction still allows for 
more complex encodings/compression to reduce footer size.
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] DRAFT: Parquet 3 metadata with decoupled column metadata [parquet-format]

Reply via email to