[ 
https://issues.apache.org/jira/browse/PARQUET-2261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17705104#comment-17705104
 ] 

ASF GitHub Bot commented on PARQUET-2261:
-----------------------------------------

emkornfield commented on code in PR #197:
URL: https://github.com/apache/parquet-format/pull/197#discussion_r1148598484


##########
src/main/thrift/parquet.thrift:
##########
@@ -223,6 +223,17 @@ struct Statistics {
     */
    5: optional binary max_value;
    6: optional binary min_value;
+   /** The number of bytes the row/group or page would take if encoded with 
plain-encoding */
+   7: optional i64 plain_encoded_bytes;

Review Comment:
   I think we end up with if/else statements anyways when we start accounting 
for logical types so I am warming to the idea of only storing bytes for 
variable width types.  I originally thought plain encoding was useful because 
it is already a well understood concept within parquet but it might not add 
value here.
   
     I agree nulls can be a significant portion of space but I don't think it 
should be baked into this field and instead handled as a computation on the 
histograms below.   My rationale for keeping them separate is the space taken 
really depends on memory representation which can differ by reader.  The actual 
nullness some system use 1 bit per null indicator, some use one byte and some 
use 0 because they use sentinel values in data fields.  Similarly some system 
reserve space in data representations for null values and some do not.  Whether 
a system leaves space can also vary depending on at what level a null occurs in 
a nested structure.
   
   I think we should provide guidance that writers should populate all three of 
the newly proposed fields so estimates can be accurate in the presence of nulls.





> [Format] Add statistics that reflect decoded size to metadata
> -------------------------------------------------------------
>
>                 Key: PARQUET-2261
>                 URL: https://issues.apache.org/jira/browse/PARQUET-2261
>             Project: Parquet
>          Issue Type: Improvement
>          Components: parquet-format
>            Reporter: Micah Kornfield
>            Assignee: Micah Kornfield
>            Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to