Re: [PR] DRAFT: Parquet 3 metadata with decoupled column metadata [parquet-format]

via GitHub Thu, 06 Jun 2024 04:46:51 -0700


AudriusButkevicius commented on code in PR #242:
URL: https://github.com/apache/parquet-format/pull/242#discussion_r1629368632



##########
src/main/thrift/parquet.thrift:
##########
@@ -885,6 +971,44 @@ struct ColumnChunk {
   9: optional binary encrypted_column_metadata
 }
 
+struct ColumnChunkV3 {
+  /** File where column data is stored. **/
+  1: optional string file_path

Review Comment:
   The _metadata file has all of the row group details/stats. If I have a 
dataset that I need to read and filter to a given date (where date is not part 
of the filesystem partitioning scheme), I can use _metadata file to filter down 
the row groups, see which files those row groups belong to and only read those 
files. Otherwise I'd have to open every file, read it's row group stats, decide 
the file doesn't have the date I'm after, close file, move to the next file, 
repeat.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] DRAFT: Parquet 3 metadata with decoupled column metadata [parquet-format]

Reply via email to