Re: [PR] DRAFT: Parquet 3 metadata with decoupled column metadata [parquet-format]

via GitHub Thu, 16 May 2024 05:41:25 -0700


mapleFU commented on code in PR #242:
URL: https://github.com/apache/parquet-format/pull/242#discussion_r1603272729



##########
src/main/thrift/parquet.thrift:
##########
@@ -835,6 +864,63 @@ struct ColumnMetaData {
   16: optional SizeStatistics size_statistics;
 }
 
+struct ColumnChunkMetaDataV3 {
+  /** REMOVED from v1: type (redundant with SchemaElementV3) */
+  /** REMOVED from v1: encodings (unnecessary and non-trivial to get right) */
+  /** REMOVED from v1: path_in_schema (unnecessary and wasteful) */
+  /** REMOVED from v1: index_page_offset (unused in practice?) */
+  /** REMOVED from v1: statistics (use ColumnIndex and/or page-level 
statistics instead) */

Review Comment:
   Row-group level statistics can be keeped as binary? I think that:
   1. Currently pyarrow lacks PageIndex Filtering, which making it should keep 
"small" row-group during write, and use rowgroup level filtering
   2. With PageIndex we can avoid writing "small" rowgroups. But maybe keep 
row-group statistics can still help us fast filtering the whole rg



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] DRAFT: Parquet 3 metadata with decoupled column metadata [parquet-format]

Reply via email to