emkornfield commented on code in PR #197:
URL: https://github.com/apache/parquet-format/pull/197#discussion_r1148598484


##########
src/main/thrift/parquet.thrift:
##########
@@ -223,6 +223,17 @@ struct Statistics {
     */
    5: optional binary max_value;
    6: optional binary min_value;
+   /** The number of bytes the row/group or page would take if encoded with 
plain-encoding */
+   7: optional i64 plain_encoded_bytes;

Review Comment:
   I think we end up with if/else statements anyways when we start accounting 
for logical types so I am warming to the idea of only storing bytes for 
variable width types.  I originally thought plain encoding was useful because 
it is already a well understood concept within parquet but it might not add 
value here.
   
     I agree nulls can be a significant portion of space but I don't think it 
should be baked into this field and instead handled as a computation on the 
histograms below.   My rationale for keeping them separate is the space taken 
really depends on memory representation which can differ by reader.  The actual 
nullness some system use 1 bit per null indicator, some use one byte and some 
use 0 because they use sentinel values in data fields.  Similarly some system 
reserve space in data representations for null values and some do not.  Whether 
a system leaves space can also vary depending on at what level a null occurs in 
a nested structure.
   
   I think we should provide guidance that writers should populate all three of 
the newly proposed fields so estimates can be accurate in the presence of nulls.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to