mapleFU commented on code in PR #197:
URL: https://github.com/apache/parquet-format/pull/197#discussion_r1303273552


##########
src/main/thrift/parquet.thrift:
##########
@@ -974,6 +1050,13 @@ struct ColumnIndex {
 
   /** A list containing the number of null values for each page **/
   5: optional list<i64> null_counts
+  /** 
+    * Repetition and definition level histograms for the pages.  
+    *
+    * This contains some redundancy with null_counts, however, to accommodate  
the
+    * widest range of readers both should be populated.
+   **/
+  6: optional list<RepetitionDefinitionLevelHistogram> 
repetition_definition_level_histograms; 

Review Comment:
   Hmm first of all, PageIndex might not a "footer", because it has some 
flexibility for puting it.( each rowgroup has a `(length, offset)` pair for 
column and offset index)
   
   Estimate batch size is important, however I wonder a page-level statistics 
in "index" or "footer" might be a bit weird(because we might have it in 
per-page). If you want it, I think you can try to draft a new pull request in 
this repo, and maybe put the statistics in footer or index.
   
   I've searched in the project:
   
   1. `OffsetIndex` has a compressed-size, but actually it's for IO. 
   2. `ColumMetadata` has an ` encoding_stats`, but it's for every encoding
   
   I think the 1-2 are both not perfect suitable here. And as-for user defined 
extended info, we can even encode the user-defined stats in 
`key_value_metadata` as base64 or base86 string
   
   Welcome to draft a pull-request in this project.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to