Re: [PR] DRAFT: Parquet 3 metadata with decoupled column metadata [parquet-format]

via GitHub Thu, 06 Jun 2024 12:52:37 -0700


steveloughran commented on code in PR #242:
URL: https://github.com/apache/parquet-format/pull/242#discussion_r1630144138



##########
src/main/thrift/parquet.thrift:
##########
@@ -885,6 +971,44 @@ struct ColumnChunk {
   9: optional binary encrypted_column_metadata
 }
 
+struct ColumnChunkV3 {
+  /** File where column data is stored. **/
+  1: optional string file_path

Review Comment:
   > And the benefits of using this _metadata index file should translate to 
cloud object stores too by reducing the number of objects/files to be read. 
   
   not really
   1. forces a full read of all generated files in job commit, which even if 
done in parallel is really slow. If it were to be done, it'd be better off done 
on demand in the first query. (note, faster reads would improve this)
   1. it doesn't work with the cloud committer design, which #1361 formalises 
without doing some bridging classes. 
   
   the reason for (2) is that the hadoop cloud-native committer design kept 
clear of making any changes to the superclass of `ParquetOutputCommitter` as it 
is a critical piece of code in so many existing workflows, and really hard to 
understand. Not just a co-recursive algorithm, but two intermingled algorithms, 
one of which lacks the correctness guarantees (failures during task commit can 
be recovered from).
   
   with a move to table based formats rather than directory trees, that whole 
commit process becomes much easier as well as supporting atomic job commits on 
a table (including deletes!). And as you note, these formats can include schema 
info too.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] DRAFT: Parquet 3 metadata with decoupled column metadata [parquet-format]

Reply via email to