etseidl commented on code in PR #197:
URL: https://github.com/apache/parquet-format/pull/197#discussion_r1303299069


##########
src/main/thrift/parquet.thrift:
##########
@@ -974,6 +1050,13 @@ struct ColumnIndex {
 
   /** A list containing the number of null values for each page **/
   5: optional list<i64> null_counts
+  /** 
+    * Repetition and definition level histograms for the pages.  
+    *
+    * This contains some redundancy with null_counts, however, to accommodate  
the
+    * widest range of readers both should be populated.
+   **/
+  6: optional list<RepetitionDefinitionLevelHistogram> 
repetition_definition_level_histograms; 

Review Comment:
   > Hmm first of all, PageIndex might not a "footer", because it has some 
flexibility for puting it.( each rowgroup has a `(length, offset)` pair for 
column and offset index)
   
   Yes, I'm using "footer" liberally.  I just mean not inline with the page 
data when I say "footer".
   
   > I think the 1-2 are both not perfect suitable here. And as-for user 
defined extended info, we can even encode the user-defined stats in 
key_value_metadata as base64 or base86 string
   
   This is exactly what I was doing in 
https://github.com/rapidsai/cudf/pull/12974. I put  the (location,size) of the 
unencoded size info struct in key/value pairs, and then wrote a per-column 
chunk struct before the column/offset indexes with the form:
   
   ```
   struct ColumnChunkSize {
     /** Sum of page_sizes. */
     1: required i64 chunk_size
     /** Size of page data in bytes. */
     2: required list<i64> page_sizes
   }
   ```
   If this PR is adopted as-is, then the above structure wouldn't be needed at 
all for fixed length data types, just for byte arrays, and the `chunk_size` 
field wouldn't be necessary either since it would already be in the column 
chunk's `SizeStatistics` struct. Does this seem reasonable? If so I can submit 
a draft after this PR is merged.
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to