[ 
https://issues.apache.org/jira/browse/PARQUET-2261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17705906#comment-17705906
 ] 

ASF GitHub Bot commented on PARQUET-2261:
-----------------------------------------

emkornfield commented on code in PR #197:
URL: https://github.com/apache/parquet-format/pull/197#discussion_r1150299027


##########
src/main/thrift/parquet.thrift:
##########
@@ -223,6 +223,17 @@ struct Statistics {
     */
    5: optional binary max_value;
    6: optional binary min_value;
+   /** The number of bytes the row/group or page would take if encoded with 
plain-encoding */
+   7: optional i64 plain_encoded_bytes;
+   /** 
+     * When present there is expected to be one element corresponding to each 
repetition (i.e. size=max repetition_leve) 
+     * where each element represens the count of the number of times that 
level occurs in the page/column chunk.
+     */
+   8: optional list<i64> repetition_level_histogram;

Review Comment:
   So some simple examples in from Arrow (my rep/def level calculations are 
rusty so please double check to make sure they make sense):
   1.  If all you have are nested structures (say 3 nullable levels), then the 
number of null values at the leaf is the sum of the first 3 elements of the 
definition level histogram.  Individual level nullability for Arrow isn't super 
important because it always leaves the space for nulls so it is pretty much 
square.  You could however determine that there are no nulls at the second 
level of nesting by checking the appropriate histogram box (and potentially 
save the allocation of the bit vector in that case)
   2.  If you have two nested lists and assuming lists and elements are not 
nullable.  The number of inner nested lists is `rep_hist[0]  - def_hist[0] + 
{rep_hist[1]` (outer list starts - empty outer lists + inner list starts).





> [Format] Add statistics that reflect decoded size to metadata
> -------------------------------------------------------------
>
>                 Key: PARQUET-2261
>                 URL: https://issues.apache.org/jira/browse/PARQUET-2261
>             Project: Parquet
>          Issue Type: Improvement
>          Components: parquet-format
>            Reporter: Micah Kornfield
>            Assignee: Micah Kornfield
>            Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to