[jira] [Commented] (PARQUET-2261) [Format] Add statistics that reflect decoded size to metadata

ASF GitHub Bot (Jira) Wed, 23 Aug 2023 08:33:04 -0700


    [ 
https://issues.apache.org/jira/browse/PARQUET-2261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17758117#comment-17758117
 ]


ASF GitHub Bot commented on PARQUET-2261:
-----------------------------------------

etseidl commented on code in PR #197:
URL: https://github.com/apache/parquet-format/pull/197#discussion_r1303197041


##########
src/main/thrift/parquet.thrift:
##########
@@ -974,6 +1050,13 @@ struct ColumnIndex {
 
   /** A list containing the number of null values for each page **/
   5: optional list<i64> null_counts
+  /** 
+    * Repetition and definition level histograms for the pages.  
+    *
+    * This contains some redundancy with null_counts, however, to accommodate  
the
+    * widest range of readers both should be populated.
+   **/
+  6: optional list<RepetitionDefinitionLevelHistogram> 
repetition_definition_level_histograms; 

Review Comment:
   > f you are not reading a parquet file in the streaming fashion, why 
SizeStatistics in the column-chunk level is not enough? The pages of different 
columns are not aligned and you somehow will end up with reading the entire 
column chunk.
   
   @wgtmac just because the pages aren't aligned doesn't mean I have to read 
them all :wink: In a large row group with small pages, the non-alignment can be 
minimized and there can still be a win from not reading unnecessary pages.
   
   As to why the column-chunk level sizing info isn't enough, I have files 
where the un-encoded size of the file is over 40X larger than the on-disk 
sizes, due primarily to vast savings in the dictionary encoding. So a 1GB row 
group could potentially blow up to 40GB when fully decoded.  In the constrained 
environment of a GPU that's not tenable.  Being able to know in advance which 
pages I can read and decode while still keeping everything on the GPU is very 
beneficial.  To get this sizing information now, we have to read and decompress 
every page, doing most of the work of decoding the file just to find the total 
size of all the byte arrays. I'd prefer not to have to make 2 passes through 
the file :smile: 





> [Format] Add statistics that reflect decoded size to metadata
> -------------------------------------------------------------
>
>                 Key: PARQUET-2261
>                 URL: https://issues.apache.org/jira/browse/PARQUET-2261
>             Project: Parquet
>          Issue Type: Improvement
>          Components: parquet-format
>            Reporter: Micah Kornfield
>            Assignee: Micah Kornfield
>            Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (PARQUET-2261) [Format] Add statistics that reflect decoded size to metadata

Reply via email to