etseidl commented on code in PR #197:
URL: https://github.com/apache/parquet-format/pull/197#discussion_r1319212210


##########
src/main/thrift/parquet.thrift:
##########
@@ -977,6 +1038,25 @@ struct ColumnIndex {
 
   /** A list containing the number of null values for each page **/
   5: optional list<i64> null_counts
+
+  /**
+   * Contains repetition level histograms for more details) for each page
+   * concatenated together.  The repetition_level_histogram field on
+   * SizeStatistics contains more details.
+   *
+   * When present the length should always be (number of pages *
+   * (max_repetition_level + 1)) elements in size.
+   *
+   * Element 0 is the first element of the histogram for the first page.
+   * Element (max_repetition_level + 1) is the first element of the histogram
+   * for the second page.

Review Comment:
   > Column-major vs row-major terminology confused me, so I documented one 
approach. If there isn't consensus on the ordering here, lets please create a 
new thread.
   
   I'll take a stab (and likely be wrong :wink:). I believe the discussion has 
assumed the histograms form a matrix with the row index being page number, the 
column index being the level. Assuming that, what you have defined would be the 
row major ordering, where elements of the same row are contiguous in memory, as 
in a C matrix. What @JFinis and @pitrou seem to prefer is the opposite, where 
elements of the same column are contiguous in memory, as in a Fortran matrix.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to