[GitHub] [parquet-format] JFinis commented on a diff in pull request #197: PARQUET-2261: add statistics for better estimating unencoded/uncompressed sizes and finer grained filtering

via GitHub Wed, 06 Sep 2023 12:42:58 -0700


JFinis commented on code in PR #197:
URL: https://github.com/apache/parquet-format/pull/197#discussion_r1317578394



##########
src/main/thrift/parquet.thrift:
##########
@@ -977,6 +1073,15 @@ struct ColumnIndex {
 
   /** A list containing the number of null values for each page **/
   5: optional list<i64> null_counts
+  /**
+    * Repetition and definition level histograms for the pages.
+    *
+    * This contains some redundancy with null_counts, however, to accommodate
+    * the widest range of readers both should be populated when either the max
+    * definition and repetition level meet the requirements specified in
+    * RepetitionDefinitionLevelHistogram.
+   **/
+  6: optional list<RepetitionDefinitionLevelHistogram> 
repetition_definition_level_histograms

Review Comment:
   The way you encode the values here is wasteful, both in the thrift size but 
even more in the deserialized size, as you store a lot of small lists. Given 
that the column index can become big due to many pages, I would consider this 
inefficiency an issue.
   
   I would propose not grouping repetition & definition level into a struct, 
especially in the column index, so we don't need a struct per page. 
Furthermore, I would actually propose to even "turn the lists around", i.e.:
   
   ```
   optional list<list<i64>> repetition_level_histogram
   
   optional list<list<i64>> definition_level_histogram
   ```
   
   Here, the **outer** list would actually be the list of levels and the 
**inner** list the list of pages. So, if we have two possible levels and 5 
pages, this would store e.g. `[[3,5,2,1,3],[4,5,6,7,8]]`.
   
   This way, the encoding will be way tighter and also more columnar, helping 
with efficient processing.
   
   The reason for this is that most of the time, we will have way more pages 
than levels. E.g., let's say we have 100 pages and 5 repetition and definition 
levels.
   
   With your current proposal, you would need 201 lists (one outer, 100 for 
definition levels, 100 for repetition levels) and 100 structs to represent the 
deserialized form. Especially in languages like Java, each of these objects has 
a considerable overhead. E.g., an ArrayList has 16 bytes overhead and a struct 
has 8, so you need a whopping 4016 bytes just to encode the structs and lists 
without counting the elements in them.
   
   With my proposed encoding, you would only store 12 lists (two outer, 5 inner 
for definition levels, 5 inner for repetition levels) and no structs, having 
only 192 bytes overhead.
   
   In thrift, the overhead is lower, but it is still existent and costs space 
and performance.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [parquet-format] JFinis commented on a diff in pull request #197: PARQUET-2261: add statistics for better estimating unencoded/uncompressed sizes and finer grained filtering

Reply via email to