[ https://issues.apache.org/jira/browse/PARQUET-2352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17772994#comment-17772994 ]
ASF GitHub Bot commented on PARQUET-2352: ----------------------------------------- wgtmac commented on code in PR #216: URL: https://github.com/apache/parquet-format/pull/216#discussion_r1349706308 ########## src/main/thrift/parquet.thrift: ########## @@ -216,13 +216,22 @@ struct Statistics { /** count of distinct values occurring */ 4: optional i64 distinct_count; /** - * Min and max values for the column, determined by its ColumnOrder. + * lower and upper bound values for the column, determined by its ColumnOrder. + * These may be the actual minimum and maximum values found on a column chunk, + * but can also be (more compact) values that do not exist on a column chunk. + * For example, instead of storing "Blart Versenwald III", a writer may set + * min_value="B", max_value="C". Such more compact values must still be valid + * values within the column's logical type. * * Values are encoded using PLAIN encoding, except that variable-length byte * arrays do not include a length prefix. */ 5: optional binary max_value; 6: optional binary min_value; + /** If true, max_value is the actual maximum value found on a column chunk **/ + 7: optional bool is_max_value_exact; + /** If true, min_value is the actual minimum value found on a column chunk **/ Review Comment: ```suggestion /** If true, min_value is the actual minimum value for a column */ ``` ########## src/main/thrift/parquet.thrift: ########## @@ -216,13 +216,22 @@ struct Statistics { /** count of distinct values occurring */ 4: optional i64 distinct_count; /** - * Min and max values for the column, determined by its ColumnOrder. + * lower and upper bound values for the column, determined by its ColumnOrder. + * These may be the actual minimum and maximum values found on a column chunk, + * but can also be (more compact) values that do not exist on a column chunk. + * For example, instead of storing "Blart Versenwald III", a writer may set + * min_value="B", max_value="C". Such more compact values must still be valid + * values within the column's logical type. Review Comment: ```suggestion * Lower and upper bound values for the column, determined by its ColumnOrder. * * These may be the actual minimum and maximum values found on a page or column * chunk, but can also be (more compact) values that do not exist on a page or * column chunk. For example, instead of storing "Blart Versenwald III", a writer * may set min_value="B", max_value="C". Such more compact values must still be * valid values within the column's logical type. ``` ########## src/main/thrift/parquet.thrift: ########## @@ -216,13 +216,22 @@ struct Statistics { /** count of distinct values occurring */ 4: optional i64 distinct_count; /** - * Min and max values for the column, determined by its ColumnOrder. + * lower and upper bound values for the column, determined by its ColumnOrder. + * These may be the actual minimum and maximum values found on a column chunk, + * but can also be (more compact) values that do not exist on a column chunk. + * For example, instead of storing "Blart Versenwald III", a writer may set + * min_value="B", max_value="C". Such more compact values must still be valid + * values within the column's logical type. * * Values are encoded using PLAIN encoding, except that variable-length byte * arrays do not include a length prefix. */ 5: optional binary max_value; 6: optional binary min_value; + /** If true, max_value is the actual maximum value found on a column chunk **/ Review Comment: ```suggestion /** If true, max_value is the actual maximum value for a column */ ``` > Update parquet format spec to allow truncation of row group min/max stats > ------------------------------------------------------------------------- > > Key: PARQUET-2352 > URL: https://issues.apache.org/jira/browse/PARQUET-2352 > Project: Parquet > Issue Type: Improvement > Reporter: Raunaq Morarka > Priority: Major > > Column index stats are explicitly allowed to be truncated > [https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L958] > However, it seems row group min/max stats are not allowed to be truncated > [https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L219] > although it is not explicitly clarified like in the column index case. This > forces implementations to either drop min/max row group stats for columns > with long strings and miss opportunities for filtering row groups or > seemingly deviate from spec by truncating min/max row group stats. > https://issues.apache.org/jira/browse/PARQUET-1685 introduced a feature to > parquet-mr which allows users to deviate from spec and configure truncation > of min/max row group stats. Unfortunately, there is no way for readers to > detect whether truncation took place. > Since the possibility of truncation exists and is not possible to explicitly > detect, attempts to pushdown min/max aggregation to parquet have avoided > implementing it for string columns (e.g. > https://issues.apache.org/jira/browse/SPARK-36645) > Given the above situation, the spec should be updated to allow truncation of > min/max row group stats. This would align the spec with current reality that > string column min/max row group stats could be truncated. > Additionally, a flag could be added to the stats to specify whether min/max > stats are truncated. Reader implementations could then safely implement > min/max aggregation pushdown to strings for new data going forward by > checking the value of this flag. When the flag is not found on existing data > then it could be assumed that the data could be truncated. -- This message was sent by Atlassian Jira (v8.20.10#820010)