emkornfield commented on code in PR #242:
URL: https://github.com/apache/parquet-format/pull/242#discussion_r1608945075
##########
src/main/thrift/parquet.thrift:
##########
@@ -1165,6 +1317,62 @@ struct FileMetaData {
9: optional binary footer_signing_key_metadata
}
+/** Metadata for a column in this file. */
+struct FileColumnMetadataV3 {
+ /** All column chunks in this file (one per row group) **/
+ 1: required list<ColumnChunkV3> columns
Review Comment:
> > Instead of being an offset, I suppose this could just be modeled in the
message as [bytes]
>
> Sure, but what would that change exactly? You'll have to decode it anyway.
If the question is specifically for bytes in thrift vs a page stored
someplace else, I think the main trade-off is how easy it is to achieve zero
copy on the underlying bytes that need to be decoded vs the complexity of
handling the offset. Decoding of just one a few large byte arrays does not
have to be expensive, but in practice at least in C++ it is more expensive then
it probably should be. It seems java might actually have the ability to do
zero copy here.
If the question is why a data page (which still needs decoding), I was
thinking if we introduced a new encoding which effectively uses the arrow byte
array layout `([cumulative_offsets_into_byte_data], [byte_data])`, then
decoding is really the of parsing the page header, and the page would provide
random access to elements that could be decoding individually. For end-users
more concerned about overall space, then the page abstraction still allows for
more complex encodings/compression to reduce footer size.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]