Of course min & max can also be non-exact on ints, doubles, etc. But for those types, being inexact doesn't mean being truncated. It just means being "wider" than the true bounds. E.g., if the true max is 5 and the max in the statistics is 8, then it is non-exact, so the notion of not being exact also makes sense for fixed length types as well.
The question is whether it would ever make sense to store a non-tight bound for fixed-length types. For formats that aggregate across multiple files or row groups, the clear answer is "yes"! This can happen when files / row groups were deleted. E.g., say the max between a couple of Parquet files is 8. Now one Parquet file is deleted from this set of files (say in an Iceberg or DeltaLake table). Without checking all other files, we can't know what the new max is, but we know it is less or equal to 8. So we could update the statistics, keeping 8 in there but marking it as not exact, as it indeed might not be exact. But now, what does this mean in context of Parquet. Parquet, as of now, doesn't have statistics spanning multiple row groups, so the example I made above doesn't appear in Parquet. You could make the same case for pages inside a row group (i.e., if one page gets "deleted", then you could update the column chunk statistics, setting the min & max as not exact), but selective deletion of pages is a weird use case, so probably nothing that needs to be considered. Another case why you might want to have non-tight bounds is if you don't want to compute them but derive them from existing statistics that were already labeled as non-exact. Say you "compact" a whole Iceberg into a single Parquet row group and you don't want to compute the min/max but instead want to take the ones from the Iceberg and these were labeled as non-exact. This use case is pretty dubious I admit. Computing a min & max when you're touching the data anyway is cheap enough to just do it. So in conclusion, I don't see a compelling use case why someone would want non-exact bounds on fixed size types in Parquet. But semantically speaking, they do make sense also for fixed size types. Cheers, Jan Am Do., 5. Sept. 2024 um 09:19 Uhr schrieb Gábor Szádovszky < ga...@apache.org>: > Hi Xuwei, > > There is no "exact" flag from page index because the values there are "not > exact" by design. See "observations" at [1]. > > 1. I think we should be more precise in the spec. It would not make sense > to truncate 32 or 64 bit values and it won't be compatible with existing > implementations either. So, I would say, the "exact" flags would only be > meaningful for BYTE_ARRAY and FIXED_LEN_BYTE_ARRAY types. For any other > types the flags should not be used and would mean "exact". > 2. Since we already have releases that may produce truncation (see [2]) > without having the related flags in the format, we shall not handle min/max > values as exact without the flags. If the flags are not present, we shall > handle BYTE_ARRAY and FIXED_LEN_BYTE_ARRAY types as potentially truncated. > > Cheers, > Gabor > > [1] > > https://github.com/apache/parquet-format/blob/master/PageIndex.md#technical-approach > [2] https://issues.apache.org/jira/browse/PARQUET-1685 > > wish maple <maplewish...@gmail.com> ezt írta (időpont: 2024. szept. 5., > Cs, > 5:18): > > > Currently, Parquet-spec[1] and implementations in parquet-java[2], > > parquet-rs[3] allows truncation in Parquet statistics. The statistics > > truncation might happens in ColumnChunk level statistics, page level > > statistics and Page Index. > > > > Currently, the truncate in [2][3] follows the underlying rule: > > > > In arrow-rs[2]: > > 1. Only BYTE_ARRAY and (non-decimal/f16) FLBA can be truncated > > 2. The truncated utf-8 should also be utf-8. > > > > In parquet-java [3][4]. The writer would maintains a "truncate-length", > and > > String type would > > be truncate to this length. > > > > Currently, in public parquet-format spec[1], we have > `is_{min|max}_exact`, > > but it's only in > > `Statistics`, and not in PageIndex. > > > > So, when consuming a Statistics: > > 1. Can Int32/Int64/Float be statistics decided "exact" if it exists, even > > if Statistics.{min|max}_exact is not set? > > 2. Should string/flba statistics regarded as "in-exact" if > > Statistics.{min|max}_exact is not set? > > > > Best, > > Xuwei Fu > > > > [1] > > > > > https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L285-L288 > > [2] > > > > > https://github.com/apache/arrow-rs/blob/efe867a5a202f03846d8b6c737cb62ff16054940/parquet/src/column/writer/mod.rs#L837 > > [3] > > > > > https://github.com/apache/parquet-java/blob/master/parquet-column/src/main/java/org/apache/parquet/internal/column/columnindex/BinaryTruncator.java#L182 > > [4] > > > > > https://github.com/apache/parquet-java/blob/aec7bc64dffa373db678ab2fc8b46565b4c011a5/parquet-column/src/main/java/org/apache/parquet/column/statistics/BinaryStatistics.java#L116 > > >