[jira] [Commented] (PARQUET-2261) [Format] Add statistics that reflect decoded size to metadata

ASF GitHub Bot (Jira) Sun, 26 Mar 2023 21:20:04 -0700


    [ 
https://issues.apache.org/jira/browse/PARQUET-2261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17705170#comment-17705170
 ]


ASF GitHub Bot commented on PARQUET-2261:
-----------------------------------------

emkornfield commented on code in PR #197:
URL: https://github.com/apache/parquet-format/pull/197#discussion_r1148766758


##########
src/main/thrift/parquet.thrift:
##########
@@ -223,6 +223,17 @@ struct Statistics {
     */
    5: optional binary max_value;
    6: optional binary min_value;
+   /** The number of bytes the row/group or page would take if encoded with 
plain-encoding */
+   7: optional i64 plain_encoded_bytes;

Review Comment:
   > Does this mean the result is different if a certain decimal type uses 
INT32/INT64/FLBA/BA physical type? Should we use the minimal bytes required for 
its precision?
   
   Updated to use minimal bytes.  This  actually makes me want to revert back 
only variable length bytes exclusive of length.  Even though this requires some 
computation on readers size, I think it is probably the most worthwhile again 
because of different memory models and schema evolution.  For instance if 
precision of the decimal type is widened in a data set, I think for join 
planning and memory allocation purposes it is most useful to use the reader 
precision rather then the writer precision.  Similarly in common memory models 
(e.g. spark unsafe row) integers always take 64 bits regardless of the storage 
type.
   
   > Putting into RowGroup is really tricky as we can only sum all columns. It 
is not realistic to permute combination for all columns.
   
   Yep, came to the same conclusion as I was updating the PR, this is 
impossible and would need other solutions.
   
   > If most cases, the nested depth should not be too deep. I think we can 
make it simple as the current histogram design. IIUC, we can also derive the 
cumulative result easily from the current histogram design.
   
   reverted back to historgram.





> [Format] Add statistics that reflect decoded size to metadata
> -------------------------------------------------------------
>
>                 Key: PARQUET-2261
>                 URL: https://issues.apache.org/jira/browse/PARQUET-2261
>             Project: Parquet
>          Issue Type: Improvement
>          Components: parquet-format
>            Reporter: Micah Kornfield
>            Assignee: Micah Kornfield
>            Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (PARQUET-2261) [Format] Add statistics that reflect decoded size to metadata

Reply via email to