[ 
https://issues.apache.org/jira/browse/PARQUET-2352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17772994#comment-17772994
 ] 

ASF GitHub Bot commented on PARQUET-2352:
-----------------------------------------

wgtmac commented on code in PR #216:
URL: https://github.com/apache/parquet-format/pull/216#discussion_r1349706308


##########
src/main/thrift/parquet.thrift:
##########
@@ -216,13 +216,22 @@ struct Statistics {
    /** count of distinct values occurring */
    4: optional i64 distinct_count;
    /**
-    * Min and max values for the column, determined by its ColumnOrder.
+    * lower and upper bound values for the column, determined by its 
ColumnOrder.
+    * These may be the actual minimum and maximum values found on a column 
chunk,
+    * but can also be (more compact) values that do not exist on a column 
chunk.
+    * For example, instead of storing "Blart Versenwald III", a writer may set
+    * min_value="B", max_value="C". Such more compact values must still be 
valid
+    * values within the column's logical type.
     *
     * Values are encoded using PLAIN encoding, except that variable-length byte
     * arrays do not include a length prefix.
     */
    5: optional binary max_value;
    6: optional binary min_value;
+   /** If true, max_value is the actual maximum value found on a column chunk 
**/
+   7: optional bool is_max_value_exact;
+   /** If true, min_value is the actual minimum value found on a column chunk 
**/

Review Comment:
   ```suggestion
      /** If true, min_value is the actual minimum value for a column */
   ```



##########
src/main/thrift/parquet.thrift:
##########
@@ -216,13 +216,22 @@ struct Statistics {
    /** count of distinct values occurring */
    4: optional i64 distinct_count;
    /**
-    * Min and max values for the column, determined by its ColumnOrder.
+    * lower and upper bound values for the column, determined by its 
ColumnOrder.
+    * These may be the actual minimum and maximum values found on a column 
chunk,
+    * but can also be (more compact) values that do not exist on a column 
chunk.
+    * For example, instead of storing "Blart Versenwald III", a writer may set
+    * min_value="B", max_value="C". Such more compact values must still be 
valid
+    * values within the column's logical type.

Review Comment:
   ```suggestion
       * Lower and upper bound values for the column, determined by its 
ColumnOrder.
       *
       * These may be the actual minimum and maximum values found on a page or 
column
       * chunk, but can also be (more compact) values that do not exist on a 
page or
       * column chunk. For example, instead of storing "Blart Versenwald III", 
a writer
       * may set min_value="B", max_value="C". Such more compact values must 
still be
       * valid values within the column's logical type.
   ```



##########
src/main/thrift/parquet.thrift:
##########
@@ -216,13 +216,22 @@ struct Statistics {
    /** count of distinct values occurring */
    4: optional i64 distinct_count;
    /**
-    * Min and max values for the column, determined by its ColumnOrder.
+    * lower and upper bound values for the column, determined by its 
ColumnOrder.
+    * These may be the actual minimum and maximum values found on a column 
chunk,
+    * but can also be (more compact) values that do not exist on a column 
chunk.
+    * For example, instead of storing "Blart Versenwald III", a writer may set
+    * min_value="B", max_value="C". Such more compact values must still be 
valid
+    * values within the column's logical type.
     *
     * Values are encoded using PLAIN encoding, except that variable-length byte
     * arrays do not include a length prefix.
     */
    5: optional binary max_value;
    6: optional binary min_value;
+   /** If true, max_value is the actual maximum value found on a column chunk 
**/

Review Comment:
   ```suggestion
      /** If true, max_value is the actual maximum value for a column */
   ```





> Update parquet format spec to allow truncation of row group min/max stats
> -------------------------------------------------------------------------
>
>                 Key: PARQUET-2352
>                 URL: https://issues.apache.org/jira/browse/PARQUET-2352
>             Project: Parquet
>          Issue Type: Improvement
>            Reporter: Raunaq Morarka
>            Priority: Major
>
> Column index stats are explicitly allowed to be truncated 
> [https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L958]
> However, it seems row group min/max stats are not allowed to be truncated 
> [https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L219]
>  although it is not explicitly clarified like in the column index case. This 
> forces implementations to either drop min/max row group stats for columns 
> with long strings and miss opportunities for filtering row groups or 
> seemingly deviate from spec by truncating min/max row group stats.
> https://issues.apache.org/jira/browse/PARQUET-1685 introduced a feature to 
> parquet-mr which allows users to deviate from spec and configure truncation 
> of min/max row group stats. Unfortunately, there is no way for readers to 
> detect whether truncation took place.
> Since the possibility of truncation exists and is not possible to explicitly 
> detect, attempts to pushdown min/max aggregation to parquet have avoided 
> implementing it for string columns (e.g. 
> https://issues.apache.org/jira/browse/SPARK-36645)
> Given the above situation, the spec should be updated to allow truncation of 
> min/max row group stats. This would align the spec with current reality that 
> string column min/max row group stats could be truncated.
> Additionally, a flag could be added to the stats to specify whether min/max 
> stats are truncated. Reader implementations could then safely implement 
> min/max aggregation pushdown to strings for new data going forward by 
> checking the value of this flag. When the flag is not found on existing data 
> then it could be assumed that the data could be truncated.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to