Thanks, Jan for the correction. I was stuck with the idea of
truncation which naturally applies to the array types. As you explained,
non-exact min/max values might make sense for any types (except boolean).

Cheers,
Gabor

Jan Finis <jpfi...@gmail.com> ezt írta (időpont: 2024. szept. 5., Cs,
16:20):

> Of course min & max can also be non-exact on ints, doubles, etc. But for
> those types, being inexact doesn't mean being truncated. It just means
> being "wider" than the true bounds. E.g., if the true max is 5 and the max
> in the statistics is 8, then it is non-exact, so the notion of not being
> exact also makes sense for fixed length types as well.
>
> The question is whether it would ever make sense to store a non-tight bound
> for fixed-length types.
>
> For formats that aggregate across multiple files or row groups, the clear
> answer is "yes"! This can happen when files / row groups were deleted.
> E.g., say the max between a couple of Parquet files is 8. Now one Parquet
> file is deleted from this set of files (say in an Iceberg or DeltaLake
> table). Without checking all other files, we can't know what the new max
> is, but we know it is less or equal to 8. So we could update the
> statistics, keeping 8 in there but marking it as not exact, as it indeed
> might not be exact.
>
> But now, what does this mean in context of Parquet. Parquet, as of now,
> doesn't have statistics spanning multiple row groups, so the example I made
> above doesn't appear in Parquet. You could make the same case for pages
> inside a row group (i.e., if one page gets "deleted", then you could update
> the column chunk statistics, setting the min & max as not exact), but
> selective deletion of pages is a weird use case, so probably nothing that
> needs to be considered.
>
> Another case why you might want to have non-tight bounds is if you don't
> want to compute them but derive them from existing statistics that were
> already labeled as non-exact. Say you "compact" a whole Iceberg into a
> single Parquet row group and you don't want to compute the min/max but
> instead want to take the ones from the Iceberg and these were labeled as
> non-exact. This use case is pretty dubious I admit. Computing a min & max
> when you're touching the data anyway is cheap enough to just do it.
>
> So in conclusion, I don't see a compelling use case why someone would want
> non-exact bounds on fixed size types in Parquet. But semantically speaking,
> they do make sense also for fixed size types.
>
> Cheers,
> Jan
>
>
> Am Do., 5. Sept. 2024 um 09:19 Uhr schrieb Gábor Szádovszky <
> ga...@apache.org>:
>
> > Hi Xuwei,
> >
> > There is no "exact" flag from page index because the values there are
> "not
> > exact" by design. See "observations" at [1].
> >
> > 1. I think we should be more precise in the spec. It would not make sense
> > to truncate 32 or 64 bit values and it won't be compatible with existing
> > implementations either. So, I would say, the "exact" flags would only be
> > meaningful for BYTE_ARRAY and FIXED_LEN_BYTE_ARRAY types. For any other
> > types the flags should not be used and would mean "exact".
> > 2. Since we already have releases that may produce truncation (see [2])
> > without having the related flags in the format, we shall not handle
> min/max
> > values as exact without the flags. If the flags are not present, we shall
> > handle BYTE_ARRAY and FIXED_LEN_BYTE_ARRAY types as potentially
> truncated.
> >
> > Cheers,
> > Gabor
> >
> > [1]
> >
> >
> https://github.com/apache/parquet-format/blob/master/PageIndex.md#technical-approach
> > [2] https://issues.apache.org/jira/browse/PARQUET-1685
> >
> > wish maple <maplewish...@gmail.com> ezt írta (időpont: 2024. szept. 5.,
> > Cs,
> > 5:18):
> >
> > > Currently, Parquet-spec[1] and implementations in parquet-java[2],
> > > parquet-rs[3] allows truncation in Parquet statistics. The statistics
> > > truncation might happens in ColumnChunk level statistics, page level
> > > statistics and Page Index.
> > >
> > > Currently, the truncate in [2][3] follows the underlying rule:
> > >
> > > In arrow-rs[2]:
> > > 1. Only BYTE_ARRAY and (non-decimal/f16) FLBA can be truncated
> > > 2. The truncated utf-8 should also be utf-8.
> > >
> > > In parquet-java [3][4]. The writer would maintains a "truncate-length",
> > and
> > > String type would
> > > be truncate to this length.
> > >
> > > Currently, in public parquet-format spec[1], we have
> > `is_{min|max}_exact`,
> > > but it's only in
> > > `Statistics`, and not in PageIndex.
> > >
> > > So, when consuming a Statistics:
> > > 1. Can Int32/Int64/Float be statistics decided "exact" if it exists,
> even
> > > if Statistics.{min|max}_exact is not set?
> > > 2. Should string/flba statistics regarded as "in-exact" if
> > > Statistics.{min|max}_exact is not set?
> > >
> > > Best,
> > > Xuwei Fu
> > >
> > > [1]
> > >
> > >
> >
> https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L285-L288
> > > [2]
> > >
> > >
> >
> https://github.com/apache/arrow-rs/blob/efe867a5a202f03846d8b6c737cb62ff16054940/parquet/src/column/writer/mod.rs#L837
> > > [3]
> > >
> > >
> >
> https://github.com/apache/parquet-java/blob/master/parquet-column/src/main/java/org/apache/parquet/internal/column/columnindex/BinaryTruncator.java#L182
> > > [4]
> > >
> > >
> >
> https://github.com/apache/parquet-java/blob/aec7bc64dffa373db678ab2fc8b46565b4c011a5/parquet-column/src/main/java/org/apache/parquet/column/statistics/BinaryStatistics.java#L116
> > >
> >
>

Reply via email to