[ 
https://issues.apache.org/jira/browse/PARQUET-2261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17705161#comment-17705161
 ] 

ASF GitHub Bot commented on PARQUET-2261:
-----------------------------------------

emkornfield commented on code in PR #197:
URL: https://github.com/apache/parquet-format/pull/197#discussion_r1148748310


##########
src/main/thrift/parquet.thrift:
##########
@@ -223,6 +223,17 @@ struct Statistics {
     */
    5: optional binary max_value;
    6: optional binary min_value;
+   /** The number of bytes the row/group or page would take if encoded with 
plain-encoding */
+   7: optional i64 plain_encoded_bytes;

Review Comment:
   > We need to look at different levels of metadata or even perform some 
computation to gather the information required above. So my point is to write 
the raw size info for every data type (with logical type considered) and 
store/aggregate them into page and column-chunk levels (or even file level?). 
That would make life easier as the time spent in the planning stage is critical 
to some analytics use cases.
   
   @wgtmac  would the following changes suffice to address your concerns:
   1.  Change the name of the fields to `logical_stored_value_bytes` and define 
the byte count for each logical type (for Decimal, I'd propose using the 
underlying size of what it would take to use plain-encoding, for BYTE_ARRAY in 
this case, for consistency I think this means for BYTE_ARRAY we should also use 
the amount of space PLAIN_ENCODING would take).
   2. Extract the three fields into a new struct  something 
like:`SizeEstimationStatistics`.
   3. In addition to placing this struct into Statistics (which takes care of 
column level and page level) stats, also put it onto RowGroup? I'd hesitate to 
put it at the file level because this seems out of character with other 
metadata) and summing across row groups should be lightweight compared to the 
overhead of parsing the FileMetadata anyways?
   4. (Optional) If we were really concerned about optimizations we could 
convert the histogram to cumulative distribution function, which would avoid 
summing to get leaf-nulls.
   
   
   





> [Format] Add statistics that reflect decoded size to metadata
> -------------------------------------------------------------
>
>                 Key: PARQUET-2261
>                 URL: https://issues.apache.org/jira/browse/PARQUET-2261
>             Project: Parquet
>          Issue Type: Improvement
>          Components: parquet-format
>            Reporter: Micah Kornfield
>            Assignee: Micah Kornfield
>            Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to