Thanks, Jan for the correction. I was stuck with the idea of truncation which naturally applies to the array types. As you explained, non-exact min/max values might make sense for any types (except boolean).
Cheers, Gabor Jan Finis <jpfi...@gmail.com> ezt írta (időpont: 2024. szept. 5., Cs, 16:20): > Of course min & max can also be non-exact on ints, doubles, etc. But for > those types, being inexact doesn't mean being truncated. It just means > being "wider" than the true bounds. E.g., if the true max is 5 and the max > in the statistics is 8, then it is non-exact, so the notion of not being > exact also makes sense for fixed length types as well. > > The question is whether it would ever make sense to store a non-tight bound > for fixed-length types. > > For formats that aggregate across multiple files or row groups, the clear > answer is "yes"! This can happen when files / row groups were deleted. > E.g., say the max between a couple of Parquet files is 8. Now one Parquet > file is deleted from this set of files (say in an Iceberg or DeltaLake > table). Without checking all other files, we can't know what the new max > is, but we know it is less or equal to 8. So we could update the > statistics, keeping 8 in there but marking it as not exact, as it indeed > might not be exact. > > But now, what does this mean in context of Parquet. Parquet, as of now, > doesn't have statistics spanning multiple row groups, so the example I made > above doesn't appear in Parquet. You could make the same case for pages > inside a row group (i.e., if one page gets "deleted", then you could update > the column chunk statistics, setting the min & max as not exact), but > selective deletion of pages is a weird use case, so probably nothing that > needs to be considered. > > Another case why you might want to have non-tight bounds is if you don't > want to compute them but derive them from existing statistics that were > already labeled as non-exact. Say you "compact" a whole Iceberg into a > single Parquet row group and you don't want to compute the min/max but > instead want to take the ones from the Iceberg and these were labeled as > non-exact. This use case is pretty dubious I admit. Computing a min & max > when you're touching the data anyway is cheap enough to just do it. > > So in conclusion, I don't see a compelling use case why someone would want > non-exact bounds on fixed size types in Parquet. But semantically speaking, > they do make sense also for fixed size types. > > Cheers, > Jan > > > Am Do., 5. Sept. 2024 um 09:19 Uhr schrieb Gábor Szádovszky < > ga...@apache.org>: > > > Hi Xuwei, > > > > There is no "exact" flag from page index because the values there are > "not > > exact" by design. See "observations" at [1]. > > > > 1. I think we should be more precise in the spec. It would not make sense > > to truncate 32 or 64 bit values and it won't be compatible with existing > > implementations either. So, I would say, the "exact" flags would only be > > meaningful for BYTE_ARRAY and FIXED_LEN_BYTE_ARRAY types. For any other > > types the flags should not be used and would mean "exact". > > 2. Since we already have releases that may produce truncation (see [2]) > > without having the related flags in the format, we shall not handle > min/max > > values as exact without the flags. If the flags are not present, we shall > > handle BYTE_ARRAY and FIXED_LEN_BYTE_ARRAY types as potentially > truncated. > > > > Cheers, > > Gabor > > > > [1] > > > > > https://github.com/apache/parquet-format/blob/master/PageIndex.md#technical-approach > > [2] https://issues.apache.org/jira/browse/PARQUET-1685 > > > > wish maple <maplewish...@gmail.com> ezt írta (időpont: 2024. szept. 5., > > Cs, > > 5:18): > > > > > Currently, Parquet-spec[1] and implementations in parquet-java[2], > > > parquet-rs[3] allows truncation in Parquet statistics. The statistics > > > truncation might happens in ColumnChunk level statistics, page level > > > statistics and Page Index. > > > > > > Currently, the truncate in [2][3] follows the underlying rule: > > > > > > In arrow-rs[2]: > > > 1. Only BYTE_ARRAY and (non-decimal/f16) FLBA can be truncated > > > 2. The truncated utf-8 should also be utf-8. > > > > > > In parquet-java [3][4]. The writer would maintains a "truncate-length", > > and > > > String type would > > > be truncate to this length. > > > > > > Currently, in public parquet-format spec[1], we have > > `is_{min|max}_exact`, > > > but it's only in > > > `Statistics`, and not in PageIndex. > > > > > > So, when consuming a Statistics: > > > 1. Can Int32/Int64/Float be statistics decided "exact" if it exists, > even > > > if Statistics.{min|max}_exact is not set? > > > 2. Should string/flba statistics regarded as "in-exact" if > > > Statistics.{min|max}_exact is not set? > > > > > > Best, > > > Xuwei Fu > > > > > > [1] > > > > > > > > > https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L285-L288 > > > [2] > > > > > > > > > https://github.com/apache/arrow-rs/blob/efe867a5a202f03846d8b6c737cb62ff16054940/parquet/src/column/writer/mod.rs#L837 > > > [3] > > > > > > > > > https://github.com/apache/parquet-java/blob/master/parquet-column/src/main/java/org/apache/parquet/internal/column/columnindex/BinaryTruncator.java#L182 > > > [4] > > > > > > > > > https://github.com/apache/parquet-java/blob/aec7bc64dffa373db678ab2fc8b46565b4c011a5/parquet-column/src/main/java/org/apache/parquet/column/statistics/BinaryStatistics.java#L116 > > > > > >