Thanks Jan! So Actually marking a statistics as "exact" should be more exactly, not only truncating bytes, but also allowing range of values to be larger
Best, Xuwei Fu Gábor Szádovszky <ga...@apache.org> 于2024年9月6日周五 19:54写道: > Thanks, Jan for the correction. I was stuck with the idea of > truncation which naturally applies to the array types. As you explained, > non-exact min/max values might make sense for any types (except boolean). > > Cheers, > Gabor > > Jan Finis <jpfi...@gmail.com> ezt írta (időpont: 2024. szept. 5., Cs, > 16:20): > > > Of course min & max can also be non-exact on ints, doubles, etc. But for > > those types, being inexact doesn't mean being truncated. It just means > > being "wider" than the true bounds. E.g., if the true max is 5 and the > max > > in the statistics is 8, then it is non-exact, so the notion of not being > > exact also makes sense for fixed length types as well. > > > > The question is whether it would ever make sense to store a non-tight > bound > > for fixed-length types. > > > > For formats that aggregate across multiple files or row groups, the clear > > answer is "yes"! This can happen when files / row groups were deleted. > > E.g., say the max between a couple of Parquet files is 8. Now one Parquet > > file is deleted from this set of files (say in an Iceberg or DeltaLake > > table). Without checking all other files, we can't know what the new max > > is, but we know it is less or equal to 8. So we could update the > > statistics, keeping 8 in there but marking it as not exact, as it indeed > > might not be exact. > > > > But now, what does this mean in context of Parquet. Parquet, as of now, > > doesn't have statistics spanning multiple row groups, so the example I > made > > above doesn't appear in Parquet. You could make the same case for pages > > inside a row group (i.e., if one page gets "deleted", then you could > update > > the column chunk statistics, setting the min & max as not exact), but > > selective deletion of pages is a weird use case, so probably nothing that > > needs to be considered. > > > > Another case why you might want to have non-tight bounds is if you don't > > want to compute them but derive them from existing statistics that were > > already labeled as non-exact. Say you "compact" a whole Iceberg into a > > single Parquet row group and you don't want to compute the min/max but > > instead want to take the ones from the Iceberg and these were labeled as > > non-exact. This use case is pretty dubious I admit. Computing a min & max > > when you're touching the data anyway is cheap enough to just do it. > > > > So in conclusion, I don't see a compelling use case why someone would > want > > non-exact bounds on fixed size types in Parquet. But semantically > speaking, > > they do make sense also for fixed size types. > > > > Cheers, > > Jan > > > > > > Am Do., 5. Sept. 2024 um 09:19 Uhr schrieb Gábor Szádovszky < > > ga...@apache.org>: > > > > > Hi Xuwei, > > > > > > There is no "exact" flag from page index because the values there are > > "not > > > exact" by design. See "observations" at [1]. > > > > > > 1. I think we should be more precise in the spec. It would not make > sense > > > to truncate 32 or 64 bit values and it won't be compatible with > existing > > > implementations either. So, I would say, the "exact" flags would only > be > > > meaningful for BYTE_ARRAY and FIXED_LEN_BYTE_ARRAY types. For any other > > > types the flags should not be used and would mean "exact". > > > 2. Since we already have releases that may produce truncation (see [2]) > > > without having the related flags in the format, we shall not handle > > min/max > > > values as exact without the flags. If the flags are not present, we > shall > > > handle BYTE_ARRAY and FIXED_LEN_BYTE_ARRAY types as potentially > > truncated. > > > > > > Cheers, > > > Gabor > > > > > > [1] > > > > > > > > > https://github.com/apache/parquet-format/blob/master/PageIndex.md#technical-approach > > > [2] https://issues.apache.org/jira/browse/PARQUET-1685 > > > > > > wish maple <maplewish...@gmail.com> ezt írta (időpont: 2024. szept. > 5., > > > Cs, > > > 5:18): > > > > > > > Currently, Parquet-spec[1] and implementations in parquet-java[2], > > > > parquet-rs[3] allows truncation in Parquet statistics. The statistics > > > > truncation might happens in ColumnChunk level statistics, page level > > > > statistics and Page Index. > > > > > > > > Currently, the truncate in [2][3] follows the underlying rule: > > > > > > > > In arrow-rs[2]: > > > > 1. Only BYTE_ARRAY and (non-decimal/f16) FLBA can be truncated > > > > 2. The truncated utf-8 should also be utf-8. > > > > > > > > In parquet-java [3][4]. The writer would maintains a > "truncate-length", > > > and > > > > String type would > > > > be truncate to this length. > > > > > > > > Currently, in public parquet-format spec[1], we have > > > `is_{min|max}_exact`, > > > > but it's only in > > > > `Statistics`, and not in PageIndex. > > > > > > > > So, when consuming a Statistics: > > > > 1. Can Int32/Int64/Float be statistics decided "exact" if it exists, > > even > > > > if Statistics.{min|max}_exact is not set? > > > > 2. Should string/flba statistics regarded as "in-exact" if > > > > Statistics.{min|max}_exact is not set? > > > > > > > > Best, > > > > Xuwei Fu > > > > > > > > [1] > > > > > > > > > > > > > > https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L285-L288 > > > > [2] > > > > > > > > > > > > > > https://github.com/apache/arrow-rs/blob/efe867a5a202f03846d8b6c737cb62ff16054940/parquet/src/column/writer/mod.rs#L837 > > > > [3] > > > > > > > > > > > > > > https://github.com/apache/parquet-java/blob/master/parquet-column/src/main/java/org/apache/parquet/internal/column/columnindex/BinaryTruncator.java#L182 > > > > [4] > > > > > > > > > > > > > > https://github.com/apache/parquet-java/blob/aec7bc64dffa373db678ab2fc8b46565b4c011a5/parquet-column/src/main/java/org/apache/parquet/column/statistics/BinaryStatistics.java#L116 > > > > > > > > > >