Of course min & max can also be non-exact on ints, doubles, etc. But for
those types, being inexact doesn't mean being truncated. It just means
being "wider" than the true bounds. E.g., if the true max is 5 and the max
in the statistics is 8, then it is non-exact, so the notion of not being
exact also makes sense for fixed length types as well.

The question is whether it would ever make sense to store a non-tight bound
for fixed-length types.

For formats that aggregate across multiple files or row groups, the clear
answer is "yes"! This can happen when files / row groups were deleted.
E.g., say the max between a couple of Parquet files is 8. Now one Parquet
file is deleted from this set of files (say in an Iceberg or DeltaLake
table). Without checking all other files, we can't know what the new max
is, but we know it is less or equal to 8. So we could update the
statistics, keeping 8 in there but marking it as not exact, as it indeed
might not be exact.

But now, what does this mean in context of Parquet. Parquet, as of now,
doesn't have statistics spanning multiple row groups, so the example I made
above doesn't appear in Parquet. You could make the same case for pages
inside a row group (i.e., if one page gets "deleted", then you could update
the column chunk statistics, setting the min & max as not exact), but
selective deletion of pages is a weird use case, so probably nothing that
needs to be considered.

Another case why you might want to have non-tight bounds is if you don't
want to compute them but derive them from existing statistics that were
already labeled as non-exact. Say you "compact" a whole Iceberg into a
single Parquet row group and you don't want to compute the min/max but
instead want to take the ones from the Iceberg and these were labeled as
non-exact. This use case is pretty dubious I admit. Computing a min & max
when you're touching the data anyway is cheap enough to just do it.

So in conclusion, I don't see a compelling use case why someone would want
non-exact bounds on fixed size types in Parquet. But semantically speaking,
they do make sense also for fixed size types.

Cheers,
Jan


Am Do., 5. Sept. 2024 um 09:19 Uhr schrieb Gábor Szádovszky <
ga...@apache.org>:

> Hi Xuwei,
>
> There is no "exact" flag from page index because the values there are "not
> exact" by design. See "observations" at [1].
>
> 1. I think we should be more precise in the spec. It would not make sense
> to truncate 32 or 64 bit values and it won't be compatible with existing
> implementations either. So, I would say, the "exact" flags would only be
> meaningful for BYTE_ARRAY and FIXED_LEN_BYTE_ARRAY types. For any other
> types the flags should not be used and would mean "exact".
> 2. Since we already have releases that may produce truncation (see [2])
> without having the related flags in the format, we shall not handle min/max
> values as exact without the flags. If the flags are not present, we shall
> handle BYTE_ARRAY and FIXED_LEN_BYTE_ARRAY types as potentially truncated.
>
> Cheers,
> Gabor
>
> [1]
>
> https://github.com/apache/parquet-format/blob/master/PageIndex.md#technical-approach
> [2] https://issues.apache.org/jira/browse/PARQUET-1685
>
> wish maple <maplewish...@gmail.com> ezt írta (időpont: 2024. szept. 5.,
> Cs,
> 5:18):
>
> > Currently, Parquet-spec[1] and implementations in parquet-java[2],
> > parquet-rs[3] allows truncation in Parquet statistics. The statistics
> > truncation might happens in ColumnChunk level statistics, page level
> > statistics and Page Index.
> >
> > Currently, the truncate in [2][3] follows the underlying rule:
> >
> > In arrow-rs[2]:
> > 1. Only BYTE_ARRAY and (non-decimal/f16) FLBA can be truncated
> > 2. The truncated utf-8 should also be utf-8.
> >
> > In parquet-java [3][4]. The writer would maintains a "truncate-length",
> and
> > String type would
> > be truncate to this length.
> >
> > Currently, in public parquet-format spec[1], we have
> `is_{min|max}_exact`,
> > but it's only in
> > `Statistics`, and not in PageIndex.
> >
> > So, when consuming a Statistics:
> > 1. Can Int32/Int64/Float be statistics decided "exact" if it exists, even
> > if Statistics.{min|max}_exact is not set?
> > 2. Should string/flba statistics regarded as "in-exact" if
> > Statistics.{min|max}_exact is not set?
> >
> > Best,
> > Xuwei Fu
> >
> > [1]
> >
> >
> https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L285-L288
> > [2]
> >
> >
> https://github.com/apache/arrow-rs/blob/efe867a5a202f03846d8b6c737cb62ff16054940/parquet/src/column/writer/mod.rs#L837
> > [3]
> >
> >
> https://github.com/apache/parquet-java/blob/master/parquet-column/src/main/java/org/apache/parquet/internal/column/columnindex/BinaryTruncator.java#L182
> > [4]
> >
> >
> https://github.com/apache/parquet-java/blob/aec7bc64dffa373db678ab2fc8b46565b4c011a5/parquet-column/src/main/java/org/apache/parquet/column/statistics/BinaryStatistics.java#L116
> >
>

Reply via email to