[jira] [Commented] (PARQUET-2261) [Format] Add statistics that reflect decoded size to metadata

ASF GitHub Bot (Jira) Tue, 12 Sep 2023 06:38:45 -0700


    [ 
https://issues.apache.org/jira/browse/PARQUET-2261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17764189#comment-17764189
 ]


ASF GitHub Bot commented on PARQUET-2261:
-----------------------------------------

JFinis commented on code in PR #197:
URL: https://github.com/apache/parquet-format/pull/197#discussion_r1323059489


##########
src/main/thrift/parquet.thrift:
##########
@@ -977,6 +1038,25 @@ struct ColumnIndex {
 
   /** A list containing the number of null values for each page **/
   5: optional list<i64> null_counts
+
+  /**
+   * Contains repetition level histograms for more details) for each page
+   * concatenated together.  The repetition_level_histogram field on
+   * SizeStatistics contains more details.
+   *
+   * When present the length should always be (number of pages *
+   * (max_repetition_level + 1)) elements in size.
+   *
+   * Element 0 is the first element of the histogram for the first page.
+   * Element (max_repetition_level + 1) is the first element of the histogram
+   * for the second page.

Review Comment:
   I would expect the per level representation to be slightly superior, as it 
is more useful for filtering. Filtering is a process that might lead to most 
pages being skipped, so the overall query time might be super short in this 
case. The most extreme case would be a point look-up where only a single row in 
a single page survives the filters. In this case, the performance of actually 
performing the filtering on the page index might have a measurable impact.
   
   In contrast, for the size-estimation case, we're estimating the size because 
we're planning to read the page. This reading will take orders of magnitude 
longer, so it is not too important to avoid every possible cache-miss in this 
case.
   
   That being said, we're talking about micro optimizations here. Even though 
my gut feeling is that the other ordering would be superior, I don't mind this 
order. We're not creating hundreds of lists anymore, that's the most important 
point for performance.
   





> [Format] Add statistics that reflect decoded size to metadata
> -------------------------------------------------------------
>
>                 Key: PARQUET-2261
>                 URL: https://issues.apache.org/jira/browse/PARQUET-2261
>             Project: Parquet
>          Issue Type: Improvement
>          Components: parquet-format
>            Reporter: Micah Kornfield
>            Assignee: Micah Kornfield
>            Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (PARQUET-2261) [Format] Add statistics that reflect decoded size to metadata

Reply via email to