This is an automated email from the ASF dual-hosted git repository.

gangwu pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/parquet-format.git


The following commit(s) were added to refs/heads/master by this push:
     new 31f92c7  PARQUET-2352: Allow truncation of row group 
min_values/max_value statistics (#216)
31f92c7 is described below

commit 31f92c73ecca63b596b5edb391f9ac5eba9dbbf8
Author: Raunaq Morarka <[email protected]>
AuthorDate: Wed Oct 18 14:53:32 2023 +0530

    PARQUET-2352: Allow truncation of row group min_values/max_value statistics 
(#216)
    
    This updates the spec to allow truncation of row group min_values/max_value 
statistics
    so that readers can take advantage of row group pruning for predicates on 
columns
    containing long strings.
    https://issues.apache.org/jira/browse/PARQUET-1685 already introduced a 
feature to parquet-mr
    which allows users to deviate from the current spec and configure 
truncation of row group statistics.
    This change also adds is_max_value_exact/is_min_value_exact to allow 
writers to specify
    when the max_value/min_value are the actual max and min values found on the 
column chunk.
---
 src/main/thrift/parquet.thrift | 12 +++++++++++-
 1 file changed, 11 insertions(+), 1 deletion(-)

diff --git a/src/main/thrift/parquet.thrift b/src/main/thrift/parquet.thrift
index 5f50f00..9f90572 100644
--- a/src/main/thrift/parquet.thrift
+++ b/src/main/thrift/parquet.thrift
@@ -216,13 +216,23 @@ struct Statistics {
    /** count of distinct values occurring */
    4: optional i64 distinct_count;
    /**
-    * Min and max values for the column, determined by its ColumnOrder.
+    * Lower and upper bound values for the column, determined by its 
ColumnOrder.
+    *
+    * These may be the actual minimum and maximum values found on a page or 
column
+    * chunk, but can also be (more compact) values that do not exist on a page 
or
+    * column chunk. For example, instead of storing "Blart Versenwald III", a 
writer
+    * may set min_value="B", max_value="C". Such more compact values must 
still be
+    * valid values within the column's logical type.
     *
     * Values are encoded using PLAIN encoding, except that variable-length byte
     * arrays do not include a length prefix.
     */
    5: optional binary max_value;
    6: optional binary min_value;
+   /** If true, max_value is the actual maximum value for a column */
+   7: optional bool is_max_value_exact;
+   /** If true, min_value is the actual minimum value for a column */
+   8: optional bool is_min_value_exact;
 }
 
 /** Empty structs to use as logical type annotations */

Reply via email to