[ 
https://issues.apache.org/jira/browse/PARQUET-2261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17705106#comment-17705106
 ] 

ASF GitHub Bot commented on PARQUET-2261:
-----------------------------------------

emkornfield commented on code in PR #197:
URL: https://github.com/apache/parquet-format/pull/197#discussion_r1148600428


##########
src/main/thrift/parquet.thrift:
##########
@@ -223,6 +223,17 @@ struct Statistics {
     */
    5: optional binary max_value;
    6: optional binary min_value;
+   /** The number of bytes the row/group or page would take if encoded with 
plain-encoding */
+   7: optional i64 plain_encoded_bytes;
+   /** 
+     * When present there is expected to be one element corresponding to each 
repetition (i.e. size=max repetition_leve) 
+     * where each element represens the count of the number of times that 
level occurs in the page/column chunk.
+     */
+   8: optional list<i64> repetition_level_histogram;

Review Comment:
   I agree there is complexity here but I think this is the simplest and most 
complete set of information that we can provide readers.  I think the utility 
methods would likely be per system but we could likely provide some in core 
packages that give leaf value estimates that follow the rules above.
   
   @mapleFU in terms of usage these would be used similar to how row level 
reconstruction is done with them.  For instance total number of nulls at the 
leaf can be computed assuming no repeated fields as the cumulative sum of the 
first n-1 entries in definition level.  Similarly number of nested lists 
assuming no null lists should be computable with a similar cumulative sum.  
When lists are nullable repetition and definition level need to be looked at 
together to determine null vs empty lists.  Similarly the number of empty lists 
at  level can be inferred by looking at definition and repetition levels 
together





> [Format] Add statistics that reflect decoded size to metadata
> -------------------------------------------------------------
>
>                 Key: PARQUET-2261
>                 URL: https://issues.apache.org/jira/browse/PARQUET-2261
>             Project: Parquet
>          Issue Type: Improvement
>          Components: parquet-format
>            Reporter: Micah Kornfield
>            Assignee: Micah Kornfield
>            Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to