mapleFU commented on code in PR #197:
URL: https://github.com/apache/parquet-format/pull/197#discussion_r1303273552
##########
src/main/thrift/parquet.thrift:
##########
@@ -974,6 +1050,13 @@ struct ColumnIndex {
/** A list containing the number of null values for each page **/
5: optional list<i64> null_counts
+ /**
+ * Repetition and definition level histograms for the pages.
+ *
+ * This contains some redundancy with null_counts, however, to accommodate
the
+ * widest range of readers both should be populated.
+ **/
+ 6: optional list<RepetitionDefinitionLevelHistogram>
repetition_definition_level_histograms;
Review Comment:
Hmm first of all, PageIndex might not a "footer", because it has some
flexibility for puting it.( each rowgroup has a `(length, offset)` pair for
column and offset index)
Estimate batch size is important, however I wonder a page-level statistics
in "index" or "footer" might be a bit weird(because we might have it in
per-page). If you want it, I think you can try to draft a new pull request in
this repo, and maybe put the statistics in footer or index.
I've searched in the project:
1. `OffsetIndex` has a compressed-size, but actually it's for IO.
2. `ColumMetadata` has an ` encoding_stats`, but it's for every encoding
Welcome to draft here. And we can even encode the user-defined stats in
`key_value_metadata` as base64 or base86 string
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]