Re: [DISCUSS] Clarify min-max truncation in Parquet statistics

Micah Kornfield Wed, 11 Sep 2024 12:12:13 -0700

Apologies for a little bit of a thread hi-jack:


> > On one hand we handle the worst case in a
> > very fast way but pessimize every other case.
> >


On the other hand, 1 byte per row group is also absolutely negligible,
> unless you have super wide schemas and/or super short / sparse row groups.
> How often does this happen? I guess you will find use cases where a very
> wide schema is common and you will find use cases where needing min/max
> pushdown is super common. You will find cases where one of the two is
> absolutely useless, while the other is important.


I think this point is pertinent to the metadata discussion we are having.
If there is a problem adding N bytes that help certain use-cases (really in
practice if we are concerned about this particular use-case it should be
bits per column) a lot then I think we should be thinking hard if the new
metadata design is flexible enough.  In some ways it seems like there are
two orthogonal concerns that need to be addressed:

1.  Efficiency in serialization format in terms of speed/size.
2.  Efficiency in data layout to not pessimize common cases due to I/O
concerns of not caching everything needed to be read in a single operation
vs being able to have a very small number of possible additional I/O to
drastically improve some use-cases.

In short, I think we might want to further divide the current into smaller
units that are addressable separately, then trying to simply
translate/optimize one Flatbuffer monolithic structure (this raises
complexity but it feels like it is potentially worth it).

Thanks,
Micah

On Mon, Sep 9, 2024 at 2:14 PM Jan Finis <jpfi...@gmail.com> wrote:

> >
> > Another option is to keep the complexity, make inexact the default (and
> > thus not pay bytes for it on the wire) and allow engines to emit exact
> > BINARY if they so desire.
>
> I would argue that - for correctness - inexact *has to be* the default, as
> legacy writers will not write it. So you cannot assume that the missing of
> a lately introduced field means anything specific. If the information is
> missing, you have to be pessimistic, so you have to assume the worst.
> And you are right that this also solves the space problem; at least in
> thrift. In thrift, it is possible to just leave it out and save space. I
> don't know whether this is possible in the conceived flatbuffers encoding.
> If it is possible there as well, then we're good. If it's not, then the
> default won't help us.
>
>
> > On one hand we handle the worst case in a
> > very fast way but pessimize every other case.
> >
>
> On the other hand, 1 byte per row group is also absolutely negligible,
> unless you have super wide schemas and/or super short / sparse row groups.
> How often does this happen? I guess you will find use cases where a very
> wide schema is common and you will find use cases where needing min/max
> pushdown is super common. You will find cases where one of the two is
> absolutely useless, while the other is important.
>
> Intuitively for BINARY to have the same max across a large percentage of
> > the row group population, I think we either have:
> >
>
> Here the difference is indeed whether we mean *inexact* or *truncated*. I
> agree with you, for *truncated* the chance will be high that a single row
> group is the one that is definitely the one with the max, as a truncation
> also gives us a lower bound for how imprecise the bound can be. If the
> bound is merely labeled **inexact**, it is not given that it is only
> truncated, it could also be just inexact, so there is nothing you can tell
> about how inexact the bound is.
>
> Note that currently the statistics field is indeed called "*exact*", not "
> *truncated*", so with the current wording, you don't have any guarantees.
> The max for a string column could say "zzz" but the real max could be
> "aaa". You might ask why a sane writer would do this. It probably wouldn't,
> but you cannot assume a writer to be sane. The spec - in its current
> wording - allows this, so you have to handle it gracefully.
>
>
> Am Mo., 9. Sept. 2024 um 22:16 Uhr schrieb Alkis Evlogimenos
> <alkis.evlogime...@databricks.com.invalid>:
>
> > > If your min/max are not exact, you need to scan all row groups in the
> > worst case.
> >
> > True except it is a tradeoff. On one hand we handle the worst case in a
> > very fast way but pessimize every other case. Thus the probability of
> > observing the worst case is important. This is why I said:
> >
> > > how often do we need that little bit extra and how much do we pay for
> all
> > the cases where we do not need it.
> >
> > Thinking a bit more about this:
> >
> > For fixed width values (ints, floats, etc) we can be exact without cost.
> > That leaves BINARY. To be exact for BINARY we have to pay a cost (one
> byte
> > for exact/inexact plus the full string) or we chop it and be inexact. The
> > question is how often we get closer to the worst case and how much does
> it
> > matter?
> >
> > Intuitively for BINARY to have the same max across a large percentage of
> > the row group population, I think we either have:
> > 1. low number of values (not many row groups). I guess in that case it is
> > not so bad, we scan it all.
> > 2. low NDV but lots of values. In this case the column is almost
> certainly
> > dictionary encoded. In this case aggregate pushdown can be done on the
> > dictionary without decoding the data pages. That's typically one extra
> > fetch and single (dictionary) page scan per row group.
> >
> > If my intuition is correct perhaps the value of exact BINARY min/max is
> low
> > enough to not be necessary. I can see if this can be instrumented and
> > backed by data.
> >
> >
> > Another option is to keep the complexity, make inexact the default (and
> > thus not pay bytes for it on the wire) and allow engines to emit exact
> > BINARY if they so desire.
> >
>

Re: [DISCUSS] Clarify min-max truncation in Parquet statistics

Reply via email to