[ 
https://issues.apache.org/jira/browse/PARQUET-2261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17705166#comment-17705166
 ] 

ASF GitHub Bot commented on PARQUET-2261:
-----------------------------------------

wgtmac commented on code in PR #197:
URL: https://github.com/apache/parquet-format/pull/197#discussion_r1148758018


##########
src/main/thrift/parquet.thrift:
##########
@@ -223,6 +223,17 @@ struct Statistics {
     */
    5: optional binary max_value;
    6: optional binary min_value;
+   /** The number of bytes the row/group or page would take if encoded with 
plain-encoding */
+   7: optional i64 plain_encoded_bytes;

Review Comment:
   > for Decimal, I'd propose using the underlying size of what it would take 
to use plain-encoding
   
   Does this mean the result is different if a certain decimal type uses 
INT32/INT64/FLBA/BA physical type? Should we use the minimal bytes required for 
its precision?
   
   > Extract the three fields into a new struct something 
like:SizeEstimationStatistics.
   
   This sounds good. Otherwise we need to check existence of separate fields 
associated with this patch.
   
   > In addition to placing this struct into Statistics (which takes care of 
column level and page level) stats, also put it onto RowGroup? I'd hesitate to 
put it at the file level because this seems out of character with other 
metadata) and summing across row groups should be lightweight compared to the 
overhead of parsing the FileMetadata anyways?
   
   I think putting into Statistics is enough. It is cheap to access and compute 
sum of different columns in the RowGroup from there. Putting into RowGroup is 
really tricky as we can only sum all columns. It is not realistic to permute 
combination for all columns.
   
   > If we were really concerned about optimizations we could convert the 
histogram to cumulative distribution function, which would avoid summing to get 
leaf-nulls.
   
   If most cases, the nested depth should not be too deep. I think we can make 
it simple as the current histogram design. IIUC, we can also derive the 
cumulative result easily from the current histogram design.





> [Format] Add statistics that reflect decoded size to metadata
> -------------------------------------------------------------
>
>                 Key: PARQUET-2261
>                 URL: https://issues.apache.org/jira/browse/PARQUET-2261
>             Project: Parquet
>          Issue Type: Improvement
>          Components: parquet-format
>            Reporter: Micah Kornfield
>            Assignee: Micah Kornfield
>            Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to