[jira] [Commented] (PARQUET-2261) [Format] Add statistics that reflect decoded size to metadata

ASF GitHub Bot (Jira) Sun, 26 Mar 2023 18:43:07 -0700


    [ 
https://issues.apache.org/jira/browse/PARQUET-2261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17705153#comment-17705153
 ]


ASF GitHub Bot commented on PARQUET-2261:
-----------------------------------------

wgtmac commented on code in PR #197:
URL: https://github.com/apache/parquet-format/pull/197#discussion_r1148704782


##########
src/main/thrift/parquet.thrift:
##########
@@ -223,6 +223,17 @@ struct Statistics {
     */
    5: optional binary max_value;
    6: optional binary min_value;
+   /** The number of bytes the row/group or page would take if encoded with 
plain-encoding */
+   7: optional i64 plain_encoded_bytes;

Review Comment:
   I agree on putting non-null data and null data into separate fields. The 
space for null values can have a significant impact on memory footprint so I 
want to employ these statistics to derive a good batch size while reading data.
   
   It also makes sense to store un-encoded bytes for only variable-length types 
(in the parquet specs it solely means BYTE_ARRAY type.) But that is not easy to 
use in these cases:
   - Get the total raw size of the file (a.k.a. that size of all columns).
   - Get the total raw size of some selected columns.
   - Get the total raw size of selected columns in some row groups.
   - ...
   
   We need to look at different levels of metadata or even perform some 
computation to gather the information required above. So my point is to write 
the raw size info for every data type (with logical type considered) and 
store/aggregate them into page and column-chunk levels (or even file level?). 
That would make life easier as the time spent in the planning stage is critical 
to some analytics use cases.





> [Format] Add statistics that reflect decoded size to metadata
> -------------------------------------------------------------
>
>                 Key: PARQUET-2261
>                 URL: https://issues.apache.org/jira/browse/PARQUET-2261
>             Project: Parquet
>          Issue Type: Improvement
>          Components: parquet-format
>            Reporter: Micah Kornfield
>            Assignee: Micah Kornfield
>            Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (PARQUET-2261) [Format] Add statistics that reflect decoded size to metadata

Reply via email to