This is an automated email from the ASF dual-hosted git repository.
gangwu pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/parquet-format.git
The following commit(s) were added to refs/heads/master by this push:
new 31f92c7 PARQUET-2352: Allow truncation of row group
min_values/max_value statistics (#216)
31f92c7 is described below
commit 31f92c73ecca63b596b5edb391f9ac5eba9dbbf8
Author: Raunaq Morarka <[email protected]>
AuthorDate: Wed Oct 18 14:53:32 2023 +0530
PARQUET-2352: Allow truncation of row group min_values/max_value statistics
(#216)
This updates the spec to allow truncation of row group min_values/max_value
statistics
so that readers can take advantage of row group pruning for predicates on
columns
containing long strings.
https://issues.apache.org/jira/browse/PARQUET-1685 already introduced a
feature to parquet-mr
which allows users to deviate from the current spec and configure
truncation of row group statistics.
This change also adds is_max_value_exact/is_min_value_exact to allow
writers to specify
when the max_value/min_value are the actual max and min values found on the
column chunk.
---
src/main/thrift/parquet.thrift | 12 +++++++++++-
1 file changed, 11 insertions(+), 1 deletion(-)
diff --git a/src/main/thrift/parquet.thrift b/src/main/thrift/parquet.thrift
index 5f50f00..9f90572 100644
--- a/src/main/thrift/parquet.thrift
+++ b/src/main/thrift/parquet.thrift
@@ -216,13 +216,23 @@ struct Statistics {
/** count of distinct values occurring */
4: optional i64 distinct_count;
/**
- * Min and max values for the column, determined by its ColumnOrder.
+ * Lower and upper bound values for the column, determined by its
ColumnOrder.
+ *
+ * These may be the actual minimum and maximum values found on a page or
column
+ * chunk, but can also be (more compact) values that do not exist on a page
or
+ * column chunk. For example, instead of storing "Blart Versenwald III", a
writer
+ * may set min_value="B", max_value="C". Such more compact values must
still be
+ * valid values within the column's logical type.
*
* Values are encoded using PLAIN encoding, except that variable-length byte
* arrays do not include a length prefix.
*/
5: optional binary max_value;
6: optional binary min_value;
+ /** If true, max_value is the actual maximum value for a column */
+ 7: optional bool is_max_value_exact;
+ /** If true, min_value is the actual minimum value for a column */
+ 8: optional bool is_min_value_exact;
}
/** Empty structs to use as logical type annotations */