Thanks Jan!

So Actually marking a statistics as "exact" should be more exactly,
not only truncating bytes, but also allowing range of values to be
larger

Best,
Xuwei Fu

Gábor Szádovszky <ga...@apache.org> 于2024年9月6日周五 19:54写道:

> Thanks, Jan for the correction. I was stuck with the idea of
> truncation which naturally applies to the array types. As you explained,
> non-exact min/max values might make sense for any types (except boolean).
>
> Cheers,
> Gabor
>
> Jan Finis <jpfi...@gmail.com> ezt írta (időpont: 2024. szept. 5., Cs,
> 16:20):
>
> > Of course min & max can also be non-exact on ints, doubles, etc. But for
> > those types, being inexact doesn't mean being truncated. It just means
> > being "wider" than the true bounds. E.g., if the true max is 5 and the
> max
> > in the statistics is 8, then it is non-exact, so the notion of not being
> > exact also makes sense for fixed length types as well.
> >
> > The question is whether it would ever make sense to store a non-tight
> bound
> > for fixed-length types.
> >
> > For formats that aggregate across multiple files or row groups, the clear
> > answer is "yes"! This can happen when files / row groups were deleted.
> > E.g., say the max between a couple of Parquet files is 8. Now one Parquet
> > file is deleted from this set of files (say in an Iceberg or DeltaLake
> > table). Without checking all other files, we can't know what the new max
> > is, but we know it is less or equal to 8. So we could update the
> > statistics, keeping 8 in there but marking it as not exact, as it indeed
> > might not be exact.
> >
> > But now, what does this mean in context of Parquet. Parquet, as of now,
> > doesn't have statistics spanning multiple row groups, so the example I
> made
> > above doesn't appear in Parquet. You could make the same case for pages
> > inside a row group (i.e., if one page gets "deleted", then you could
> update
> > the column chunk statistics, setting the min & max as not exact), but
> > selective deletion of pages is a weird use case, so probably nothing that
> > needs to be considered.
> >
> > Another case why you might want to have non-tight bounds is if you don't
> > want to compute them but derive them from existing statistics that were
> > already labeled as non-exact. Say you "compact" a whole Iceberg into a
> > single Parquet row group and you don't want to compute the min/max but
> > instead want to take the ones from the Iceberg and these were labeled as
> > non-exact. This use case is pretty dubious I admit. Computing a min & max
> > when you're touching the data anyway is cheap enough to just do it.
> >
> > So in conclusion, I don't see a compelling use case why someone would
> want
> > non-exact bounds on fixed size types in Parquet. But semantically
> speaking,
> > they do make sense also for fixed size types.
> >
> > Cheers,
> > Jan
> >
> >
> > Am Do., 5. Sept. 2024 um 09:19 Uhr schrieb Gábor Szádovszky <
> > ga...@apache.org>:
> >
> > > Hi Xuwei,
> > >
> > > There is no "exact" flag from page index because the values there are
> > "not
> > > exact" by design. See "observations" at [1].
> > >
> > > 1. I think we should be more precise in the spec. It would not make
> sense
> > > to truncate 32 or 64 bit values and it won't be compatible with
> existing
> > > implementations either. So, I would say, the "exact" flags would only
> be
> > > meaningful for BYTE_ARRAY and FIXED_LEN_BYTE_ARRAY types. For any other
> > > types the flags should not be used and would mean "exact".
> > > 2. Since we already have releases that may produce truncation (see [2])
> > > without having the related flags in the format, we shall not handle
> > min/max
> > > values as exact without the flags. If the flags are not present, we
> shall
> > > handle BYTE_ARRAY and FIXED_LEN_BYTE_ARRAY types as potentially
> > truncated.
> > >
> > > Cheers,
> > > Gabor
> > >
> > > [1]
> > >
> > >
> >
> https://github.com/apache/parquet-format/blob/master/PageIndex.md#technical-approach
> > > [2] https://issues.apache.org/jira/browse/PARQUET-1685
> > >
> > > wish maple <maplewish...@gmail.com> ezt írta (időpont: 2024. szept.
> 5.,
> > > Cs,
> > > 5:18):
> > >
> > > > Currently, Parquet-spec[1] and implementations in parquet-java[2],
> > > > parquet-rs[3] allows truncation in Parquet statistics. The statistics
> > > > truncation might happens in ColumnChunk level statistics, page level
> > > > statistics and Page Index.
> > > >
> > > > Currently, the truncate in [2][3] follows the underlying rule:
> > > >
> > > > In arrow-rs[2]:
> > > > 1. Only BYTE_ARRAY and (non-decimal/f16) FLBA can be truncated
> > > > 2. The truncated utf-8 should also be utf-8.
> > > >
> > > > In parquet-java [3][4]. The writer would maintains a
> "truncate-length",
> > > and
> > > > String type would
> > > > be truncate to this length.
> > > >
> > > > Currently, in public parquet-format spec[1], we have
> > > `is_{min|max}_exact`,
> > > > but it's only in
> > > > `Statistics`, and not in PageIndex.
> > > >
> > > > So, when consuming a Statistics:
> > > > 1. Can Int32/Int64/Float be statistics decided "exact" if it exists,
> > even
> > > > if Statistics.{min|max}_exact is not set?
> > > > 2. Should string/flba statistics regarded as "in-exact" if
> > > > Statistics.{min|max}_exact is not set?
> > > >
> > > > Best,
> > > > Xuwei Fu
> > > >
> > > > [1]
> > > >
> > > >
> > >
> >
> https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L285-L288
> > > > [2]
> > > >
> > > >
> > >
> >
> https://github.com/apache/arrow-rs/blob/efe867a5a202f03846d8b6c737cb62ff16054940/parquet/src/column/writer/mod.rs#L837
> > > > [3]
> > > >
> > > >
> > >
> >
> https://github.com/apache/parquet-java/blob/master/parquet-column/src/main/java/org/apache/parquet/internal/column/columnindex/BinaryTruncator.java#L182
> > > > [4]
> > > >
> > > >
> > >
> >
> https://github.com/apache/parquet-java/blob/aec7bc64dffa373db678ab2fc8b46565b4c011a5/parquet-column/src/main/java/org/apache/parquet/column/statistics/BinaryStatistics.java#L116
> > > >
> > >
> >
>

Reply via email to