Re: [DISCUSS] Clarify min-max truncation in Parquet statistics

Alkis Evlogimenos Mon, 09 Sep 2024 01:11:29 -0700

TIL thank you.

In the case of aggregate pushdown, can't an engine use inexact min/max? I
understand it will have to scan at least one row group. To be exact it will
have to scan all row groups that match the inexact min (or max) across all
row groups. If we consider that the optimization is interesting for largish
tables with thousands/millions of row groups (across many files) inexact
min/max will still allow us to skip as well as exact min/max (skipping
99.99% of all row groups will have same performance as skipping 100% of all
row groups). This seems to have a very similar tradeoff as filtering: we
pay an extra byte per min/max to gain slightly more filtering/aggregation
power. The question then is the same: how often do we need that little bit
extra and how much do we pay for all the cases where we do not need it.


On Sat, Sep 7, 2024 at 4:21 PM Gang Wu <ust...@gmail.com> wrote:

> +1 to Micah's response. The `is_min_exact` and `is_max_exact`
> flags were added to support aggregate pushdown from Trino.
>
> Best,
> Gang
>
> On Sat, Sep 7, 2024 at 1:17 AM Micah Kornfield <emkornfi...@gmail.com>
> wrote:
>
> > >
> > > If we would do statistics again, could we specify it is unspecified if
> > they
> > > are exact or not - in other words the consumer needs to assume they are
> > > inexact. Anyone did any experiments to see what this means in practice?
> > How
> > > much filtering power do we lose?
> >
> >
> > I think the point of knowing if they are exact has less to do with
> > pruning/filtering power, and more to do with pushdown aggregates so one
> can
> > know if they can just use the statistics value provided or all values
> need
> > to be scanned.
> >
> > On Fri, Sep 6, 2024 at 8:25 AM Alkis Evlogimenos
> > <alkis.evlogime...@databricks.com.invalid> wrote:
> >
> > > If we would do statistics again, could we specify it is unspecified if
> > they
> > > are exact or not - in other words the consumer needs to assume they are
> > > inexact. Anyone did any experiments to see what this means in practice?
> > How
> > > much filtering power do we lose?
> > >
> > >
> > > On Fri, Sep 6, 2024 at 2:08 PM wish maple <maplewish...@gmail.com>
> > wrote:
> > >
> > > > Thanks Jan!
> > > >
> > > > So Actually marking a statistics as "exact" should be more exactly,
> > > > not only truncating bytes, but also allowing range of values to be
> > > > larger
> > > >
> > > > Best,
> > > > Xuwei Fu
> > > >
> > > > Gábor Szádovszky <ga...@apache.org> 于2024年9月6日周五 19:54写道：
> > > >
> > > > > Thanks, Jan for the correction. I was stuck with the idea of
> > > > > truncation which naturally applies to the array types. As you
> > > explained,
> > > > > non-exact min/max values might make sense for any types (except
> > > boolean).
> > > > >
> > > > > Cheers,
> > > > > Gabor
> > > > >
> > > > > Jan Finis <jpfi...@gmail.com> ezt írta (időpont: 2024. szept. 5.,
> > Cs,
> > > > > 16:20):
> > > > >
> > > > > > Of course min & max can also be non-exact on ints, doubles, etc.
> > But
> > > > for
> > > > > > those types, being inexact doesn't mean being truncated. It just
> > > means
> > > > > > being "wider" than the true bounds. E.g., if the true max is 5
> and
> > > the
> > > > > max
> > > > > > in the statistics is 8, then it is non-exact, so the notion of
> not
> > > > being
> > > > > > exact also makes sense for fixed length types as well.
> > > > > >
> > > > > > The question is whether it would ever make sense to store a
> > non-tight
> > > > > bound
> > > > > > for fixed-length types.
> > > > > >
> > > > > > For formats that aggregate across multiple files or row groups,
> the
> > > > clear
> > > > > > answer is "yes"! This can happen when files / row groups were
> > > deleted.
> > > > > > E.g., say the max between a couple of Parquet files is 8. Now one
> > > > Parquet
> > > > > > file is deleted from this set of files (say in an Iceberg or
> > > DeltaLake
> > > > > > table). Without checking all other files, we can't know what the
> > new
> > > > max
> > > > > > is, but we know it is less or equal to 8. So we could update the
> > > > > > statistics, keeping 8 in there but marking it as not exact, as it
> > > > indeed
> > > > > > might not be exact.
> > > > > >
> > > > > > But now, what does this mean in context of Parquet. Parquet, as
> of
> > > now,
> > > > > > doesn't have statistics spanning multiple row groups, so the
> > example
> > > I
> > > > > made
> > > > > > above doesn't appear in Parquet. You could make the same case for
> > > pages
> > > > > > inside a row group (i.e., if one page gets "deleted", then you
> > could
> > > > > update
> > > > > > the column chunk statistics, setting the min & max as not exact),
> > but
> > > > > > selective deletion of pages is a weird use case, so probably
> > nothing
> > > > that
> > > > > > needs to be considered.
> > > > > >
> > > > > > Another case why you might want to have non-tight bounds is if
> you
> > > > don't
> > > > > > want to compute them but derive them from existing statistics
> that
> > > were
> > > > > > already labeled as non-exact. Say you "compact" a whole Iceberg
> > into
> > > a
> > > > > > single Parquet row group and you don't want to compute the
> min/max
> > > but
> > > > > > instead want to take the ones from the Iceberg and these were
> > labeled
> > > > as
> > > > > > non-exact. This use case is pretty dubious I admit. Computing a
> > min &
> > > > max
> > > > > > when you're touching the data anyway is cheap enough to just do
> it.
> > > > > >
> > > > > > So in conclusion, I don't see a compelling use case why someone
> > would
> > > > > want
> > > > > > non-exact bounds on fixed size types in Parquet. But semantically
> > > > > speaking,
> > > > > > they do make sense also for fixed size types.
> > > > > >
> > > > > > Cheers,
> > > > > > Jan
> > > > > >
> > > > > >
> > > > > > Am Do., 5. Sept. 2024 um 09:19 Uhr schrieb Gábor Szádovszky <
> > > > > > ga...@apache.org>:
> > > > > >
> > > > > > > Hi Xuwei,
> > > > > > >
> > > > > > > There is no "exact" flag from page index because the values
> there
> > > are
> > > > > > "not
> > > > > > > exact" by design. See "observations" at [1].
> > > > > > >
> > > > > > > 1. I think we should be more precise in the spec. It would not
> > make
> > > > > sense
> > > > > > > to truncate 32 or 64 bit values and it won't be compatible with
> > > > > existing
> > > > > > > implementations either. So, I would say, the "exact" flags
> would
> > > only
> > > > > be
> > > > > > > meaningful for BYTE_ARRAY and FIXED_LEN_BYTE_ARRAY types. For
> any
> > > > other
> > > > > > > types the flags should not be used and would mean "exact".
> > > > > > > 2. Since we already have releases that may produce truncation
> > (see
> > > > [2])
> > > > > > > without having the related flags in the format, we shall not
> > handle
> > > > > > min/max
> > > > > > > values as exact without the flags. If the flags are not
> present,
> > we
> > > > > shall
> > > > > > > handle BYTE_ARRAY and FIXED_LEN_BYTE_ARRAY types as potentially
> > > > > > truncated.
> > > > > > >
> > > > > > > Cheers,
> > > > > > > Gabor
> > > > > > >
> > > > > > > [1]
> > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://github.com/apache/parquet-format/blob/master/PageIndex.md#technical-approach
> > > > > > > [2] https://issues.apache.org/jira/browse/PARQUET-1685
> > > > > > >
> > > > > > > wish maple <maplewish...@gmail.com> ezt írta (időpont: 2024.
> > > szept.
> > > > > 5.,
> > > > > > > Cs,
> > > > > > > 5:18):
> > > > > > >
> > > > > > > > Currently, Parquet-spec[1] and implementations in
> > > parquet-java[2],
> > > > > > > > parquet-rs[3] allows truncation in Parquet statistics. The
> > > > statistics
> > > > > > > > truncation might happens in ColumnChunk level statistics,
> page
> > > > level
> > > > > > > > statistics and Page Index.
> > > > > > > >
> > > > > > > > Currently, the truncate in [2][3] follows the underlying
> rule:
> > > > > > > >
> > > > > > > > In arrow-rs[2]:
> > > > > > > > 1. Only BYTE_ARRAY and (non-decimal/f16) FLBA can be
> truncated
> > > > > > > > 2. The truncated utf-8 should also be utf-8.
> > > > > > > >
> > > > > > > > In parquet-java [3][4]. The writer would maintains a
> > > > > "truncate-length",
> > > > > > > and
> > > > > > > > String type would
> > > > > > > > be truncate to this length.
> > > > > > > >
> > > > > > > > Currently, in public parquet-format spec[1], we have
> > > > > > > `is_{min|max}_exact`,
> > > > > > > > but it's only in
> > > > > > > > `Statistics`, and not in PageIndex.
> > > > > > > >
> > > > > > > > So, when consuming a Statistics:
> > > > > > > > 1. Can Int32/Int64/Float be statistics decided "exact" if it
> > > > exists,
> > > > > > even
> > > > > > > > if Statistics.{min|max}_exact is not set?
> > > > > > > > 2. Should string/flba statistics regarded as "in-exact" if
> > > > > > > > Statistics.{min|max}_exact is not set?
> > > > > > > >
> > > > > > > > Best,
> > > > > > > > Xuwei Fu
> > > > > > > >
> > > > > > > > [1]
> > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L285-L288
> > > > > > > > [2]
> > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://github.com/apache/arrow-rs/blob/efe867a5a202f03846d8b6c737cb62ff16054940/parquet/src/column/writer/mod.rs#L837
> > > > > > > > [3]
> > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://github.com/apache/parquet-java/blob/master/parquet-column/src/main/java/org/apache/parquet/internal/column/columnindex/BinaryTruncator.java#L182
> > > > > > > > [4]
> > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://github.com/apache/parquet-java/blob/aec7bc64dffa373db678ab2fc8b46565b4c011a5/parquet-column/src/main/java/org/apache/parquet/column/statistics/BinaryStatistics.java#L116
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: [DISCUSS] Clarify min-max truncation in Parquet statistics

Reply via email to