[ 
https://issues.apache.org/jira/browse/PARQUET-2352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17767364#comment-17767364
 ] 

ASF GitHub Bot commented on PARQUET-2352:
-----------------------------------------

raunaqmorarka opened a new pull request, #216:
URL: https://github.com/apache/parquet-format/pull/216

   ### Jira
   
     - https://issues.apache.org/jira/browse/PARQUET-2352
   
   This updates the spec to allow truncation of row group min_values/max_value 
statistics so that readers can take advantage of row group pruning for 
predicates on columns containing long strings.
   https://issues.apache.org/jira/browse/PARQUET-1685 already introduced a 
feature to parquet-mr which allows users to deviate from the current spec and 
configure truncation of row group statistics.
   
   Since the possibility of truncation exists and is not possible to explicitly 
detect, attempts to pushdown min/max aggregation to parquet have avoided 
implementing it for string columns (e.g. 
https://issues.apache.org/jira/browse/SPARK-36645)
   Given the above situation, the spec should be updated to allow truncation of 
min/max row group stats. This would align the spec with current reality that 
string column min/max row group stats could be truncated.
   




> Update parquet format spec to allow truncation of row group min/max stats
> -------------------------------------------------------------------------
>
>                 Key: PARQUET-2352
>                 URL: https://issues.apache.org/jira/browse/PARQUET-2352
>             Project: Parquet
>          Issue Type: Improvement
>            Reporter: Raunaq Morarka
>            Priority: Major
>
> Column index stats are explicitly allowed to be truncated 
> [https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L958]
> However, it seems row group min/max stats are not allowed to be truncated 
> [https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L219]
>  although it is not explicitly clarified like in the column index case. This 
> forces implementations to either drop min/max row group stats for columns 
> with long strings and miss opportunities for filtering row groups or 
> seemingly deviate from spec by truncating min/max row group stats.
> https://issues.apache.org/jira/browse/PARQUET-1685 introduced a feature to 
> parquet-mr which allows users to deviate from spec and configure truncation 
> of min/max row group stats. Unfortunately, there is no way for readers to 
> detect whether truncation took place.
> Since the possibility of truncation exists and is not possible to explicitly 
> detect, attempts to pushdown min/max aggregation to parquet have avoided 
> implementing it for string columns (e.g. 
> https://issues.apache.org/jira/browse/SPARK-36645)
> Given the above situation, the spec should be updated to allow truncation of 
> min/max row group stats. This would align the spec with current reality that 
> string column min/max row group stats could be truncated.
> Additionally, a flag could be added to the stats to specify whether min/max 
> stats are truncated. Reader implementations could then safely implement 
> min/max aggregation pushdown to strings for new data going forward by 
> checking the value of this flag. When the flag is not found on existing data 
> then it could be assumed that the data could be truncated.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to