Re: [DISCUSS] Clarify min-max truncation in Parquet statistics

Alkis Evlogimenos Thu, 12 Sep 2024 00:55:20 -0700

The optimizations here are not flatbuffer specific. There are some nuances
about flatbuffers vs Thrift but the issue is higher level.


The discussion has helped me with the design. I left the semantics the same
as in the current metadata but made the case where strings are inexact to
be much more compact.

On Wed, Sep 11, 2024 at 9:12 PM Micah Kornfield <emkornfi...@gmail.com>
wrote:

> Apologies for a little bit of a thread hi-jack:
>
>
> > > On one hand we handle the worst case in a
> > > very fast way but pessimize every other case.
> > >
>
>
> On the other hand, 1 byte per row group is also absolutely negligible,
> > unless you have super wide schemas and/or super short / sparse row
> groups.
> > How often does this happen? I guess you will find use cases where a very
> > wide schema is common and you will find use cases where needing min/max
> > pushdown is super common. You will find cases where one of the two is
> > absolutely useless, while the other is important.
>
>
> I think this point is pertinent to the metadata discussion we are having.
> If there is a problem adding N bytes that help certain use-cases (really in
> practice if we are concerned about this particular use-case it should be
> bits per column) a lot then I think we should be thinking hard if the new
> metadata design is flexible enough.  In some ways it seems like there are
> two orthogonal concerns that need to be addressed:
>
> 1.  Efficiency in serialization format in terms of speed/size.
> 2.  Efficiency in data layout to not pessimize common cases due to I/O
> concerns of not caching everything needed to be read in a single operation
> vs being able to have a very small number of possible additional I/O to
> drastically improve some use-cases.
>
> In short, I think we might want to further divide the current into smaller
> units that are addressable separately, then trying to simply
> translate/optimize one Flatbuffer monolithic structure (this raises
> complexity but it feels like it is potentially worth it).
>
> Thanks,
> Micah
>
> On Mon, Sep 9, 2024 at 2:14 PM Jan Finis <jpfi...@gmail.com> wrote:
>
> > >
> > > Another option is to keep the complexity, make inexact the default (and
> > > thus not pay bytes for it on the wire) and allow engines to emit exact
> > > BINARY if they so desire.
> >
> > I would argue that - for correctness - inexact *has to be* the default,
> as
> > legacy writers will not write it. So you cannot assume that the missing
> of
> > a lately introduced field means anything specific. If the information is
> > missing, you have to be pessimistic, so you have to assume the worst.
> > And you are right that this also solves the space problem; at least in
> > thrift. In thrift, it is possible to just leave it out and save space. I
> > don't know whether this is possible in the conceived flatbuffers
> encoding.
> > If it is possible there as well, then we're good. If it's not, then the
> > default won't help us.
> >
> >
> > > On one hand we handle the worst case in a
> > > very fast way but pessimize every other case.
> > >
> >
> > On the other hand, 1 byte per row group is also absolutely negligible,
> > unless you have super wide schemas and/or super short / sparse row
> groups.
> > How often does this happen? I guess you will find use cases where a very
> > wide schema is common and you will find use cases where needing min/max
> > pushdown is super common. You will find cases where one of the two is
> > absolutely useless, while the other is important.
> >
> > Intuitively for BINARY to have the same max across a large percentage of
> > > the row group population, I think we either have:
> > >
> >
> > Here the difference is indeed whether we mean *inexact* or *truncated*. I
> > agree with you, for *truncated* the chance will be high that a single row
> > group is the one that is definitely the one with the max, as a truncation
> > also gives us a lower bound for how imprecise the bound can be. If the
> > bound is merely labeled **inexact**, it is not given that it is only
> > truncated, it could also be just inexact, so there is nothing you can
> tell
> > about how inexact the bound is.
> >
> > Note that currently the statistics field is indeed called "*exact*", not
> "
> > *truncated*", so with the current wording, you don't have any guarantees.
> > The max for a string column could say "zzz" but the real max could be
> > "aaa". You might ask why a sane writer would do this. It probably
> wouldn't,
> > but you cannot assume a writer to be sane. The spec - in its current
> > wording - allows this, so you have to handle it gracefully.
> >
> >
> > Am Mo., 9. Sept. 2024 um 22:16 Uhr schrieb Alkis Evlogimenos
> > <alkis.evlogime...@databricks.com.invalid>:
> >
> > > > If your min/max are not exact, you need to scan all row groups in the
> > > worst case.
> > >
> > > True except it is a tradeoff. On one hand we handle the worst case in a
> > > very fast way but pessimize every other case. Thus the probability of
> > > observing the worst case is important. This is why I said:
> > >
> > > > how often do we need that little bit extra and how much do we pay for
> > all
> > > the cases where we do not need it.
> > >
> > > Thinking a bit more about this:
> > >
> > > For fixed width values (ints, floats, etc) we can be exact without
> cost.
> > > That leaves BINARY. To be exact for BINARY we have to pay a cost (one
> > byte
> > > for exact/inexact plus the full string) or we chop it and be inexact.
> The
> > > question is how often we get closer to the worst case and how much does
> > it
> > > matter?
> > >
> > > Intuitively for BINARY to have the same max across a large percentage
> of
> > > the row group population, I think we either have:
> > > 1. low number of values (not many row groups). I guess in that case it
> is
> > > not so bad, we scan it all.
> > > 2. low NDV but lots of values. In this case the column is almost
> > certainly
> > > dictionary encoded. In this case aggregate pushdown can be done on the
> > > dictionary without decoding the data pages. That's typically one extra
> > > fetch and single (dictionary) page scan per row group.
> > >
> > > If my intuition is correct perhaps the value of exact BINARY min/max is
> > low
> > > enough to not be necessary. I can see if this can be instrumented and
> > > backed by data.
> > >
> > >
> > > Another option is to keep the complexity, make inexact the default (and
> > > thus not pay bytes for it on the wire) and allow engines to emit exact
> > > BINARY if they so desire.
> > >
> >
>

Re: [DISCUSS] Clarify min-max truncation in Parquet statistics

Reply via email to