etseidl commented on code in PR #197:
URL: https://github.com/apache/parquet-format/pull/197#discussion_r1303299069
##########
src/main/thrift/parquet.thrift:
##########
@@ -974,6 +1050,13 @@ struct ColumnIndex {
/** A list containing the number of null values for each page **/
5: optional list<i64> null_counts
+ /**
+ * Repetition and definition level histograms for the pages.
+ *
+ * This contains some redundancy with null_counts, however, to accommodate
the
+ * widest range of readers both should be populated.
+ **/
+ 6: optional list<RepetitionDefinitionLevelHistogram>
repetition_definition_level_histograms;
Review Comment:
> Hmm first of all, PageIndex might not a "footer", because it has some
flexibility for puting it.( each rowgroup has a `(length, offset)` pair for
column and offset index)
Yes, I'm using "footer" liberally. I just mean not inline with the page
data when I say "footer".
> I think the 1-2 are both not perfect suitable here. And as-for user
defined extended info, we can even encode the user-defined stats in
key_value_metadata as base64 or base86 string
This is exactly what I was doing in
https://github.com/rapidsai/cudf/pull/12974. I put the (location,size) of the
unencoded size info struct in key/value pairs, and then wrote a per-column
chunk struct before the column/offset indexes with the form:
```
struct ColumnChunkSize {
/** Sum of page_sizes. */
1: required i64 chunk_size
/** Size of page data in bytes. */
2: required list<i64> page_sizes
}
```
If this PR is adopted as-is, then the above structure wouldn't be needed at
all for fixed length data types, just for byte arrays, and the `chunk_size`
field wouldn't be necessary either since it would already be in the column
chunk's `SizeStatistics` struct. Does this seem reasonable? If so I can submit
a draft after this PR is merged.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]