wgtmac commented on code in PR #197:
URL: https://github.com/apache/parquet-format/pull/197#discussion_r1148580752


##########
src/main/thrift/parquet.thrift:
##########
@@ -223,6 +223,17 @@ struct Statistics {
     */
    5: optional binary max_value;
    6: optional binary min_value;
+   /** The number of bytes the row/group or page would take if encoded with 
plain-encoding */
+   7: optional i64 plain_encoded_bytes;

Review Comment:
   I'd rather do this for all primitive types. Otherwise we may end up with 
different if/else approaches for different types which complicates the logic.



##########
src/main/thrift/parquet.thrift:
##########
@@ -223,6 +223,17 @@ struct Statistics {
     */
    5: optional binary max_value;
    6: optional binary min_value;
+   /** The number of bytes the row/group or page would take if encoded with 
plain-encoding */
+   7: optional i64 plain_encoded_bytes;

Review Comment:
   However, I hold a different opinion that null should be considered here. 
   
   For example, if most of values are null, we may under-estimate the size and 
make the split planning much more imprecise.
   
   Although we have `null_count` somewhere (e.g. Statistics or ColumnIndex) but 
they are optional and may be missing even if `plain_encoded_bytes` here is 
provided. If we can enforce `null_count` with `plain_encoded_bytes` to appear 
together, then this is not an issue.
   
   I am not sure if we can slightly modify the definition of 
`plain_encoded_bytes` to mean the total bytes as if the data is all non-null 
and plain-encoded. For numeric types, a null value consumes sizeof(T) bytes. 
For FLBA type, we can simply assume a null value still consumes its original 
fixed-length. For BYTE_ARRAY type, a null value consume empty length.
   
   That said, I am still open to follow the proposed approach from @emkornfield 
because it is simple to understand.



##########
src/main/thrift/parquet.thrift:
##########
@@ -223,6 +223,17 @@ struct Statistics {
     */
    5: optional binary max_value;
    6: optional binary min_value;
+   /** The number of bytes the row/group or page would take if encoded with 
plain-encoding */
+   7: optional i64 plain_encoded_bytes;
+   /** 
+     * When present there is expected to be one element corresponding to each 
repetition (i.e. size=max repetition_leve) 
+     * where each element represens the count of the number of times that 
level occurs in the page/column chunk.
+     */
+   8: optional list<i64> repetition_level_histogram;

Review Comment:
   These levels are rather complicated for high-level applications, we may need 
to provide some utility to translate to norms (e.g. struct/map/array/null) that 
are much easier to understand.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to