Re: [DISCUSS] Clarify min-max truncation in Parquet statistics

Jan Finis Mon, 09 Sep 2024 14:14:08 -0700

>
> Another option is to keep the complexity, make inexact the default (and
> thus not pay bytes for it on the wire) and allow engines to emit exact
> BINARY if they so desire.


I would argue that - for correctness - inexact *has to be* the default, as
legacy writers will not write it. So you cannot assume that the missing of
a lately introduced field means anything specific. If the information is
missing, you have to be pessimistic, so you have to assume the worst.
And you are right that this also solves the space problem; at least in
thrift. In thrift, it is possible to just leave it out and save space. I
don't know whether this is possible in the conceived flatbuffers encoding.
If it is possible there as well, then we're good. If it's not, then the
default won't help us.


> On one hand we handle the worst case in a
> very fast way but pessimize every other case.
>

On the other hand, 1 byte per row group is also absolutely negligible,
unless you have super wide schemas and/or super short / sparse row groups.
How often does this happen? I guess you will find use cases where a very
wide schema is common and you will find use cases where needing min/max
pushdown is super common. You will find cases where one of the two is
absolutely useless, while the other is important.

Intuitively for BINARY to have the same max across a large percentage of
> the row group population, I think we either have:
>

Here the difference is indeed whether we mean *inexact* or *truncated*. I
agree with you, for *truncated* the chance will be high that a single row
group is the one that is definitely the one with the max, as a truncation
also gives us a lower bound for how imprecise the bound can be. If the
bound is merely labeled **inexact**, it is not given that it is only
truncated, it could also be just inexact, so there is nothing you can tell
about how inexact the bound is.

Note that currently the statistics field is indeed called "*exact*", not "
*truncated*", so with the current wording, you don't have any guarantees.
The max for a string column could say "zzz" but the real max could be
"aaa". You might ask why a sane writer would do this. It probably wouldn't,
but you cannot assume a writer to be sane. The spec - in its current
wording - allows this, so you have to handle it gracefully.


Am Mo., 9. Sept. 2024 um 22:16 Uhr schrieb Alkis Evlogimenos
<[email protected]>:

> > If your min/max are not exact, you need to scan all row groups in the
> worst case.
>
> True except it is a tradeoff. On one hand we handle the worst case in a
> very fast way but pessimize every other case. Thus the probability of
> observing the worst case is important. This is why I said:
>
> > how often do we need that little bit extra and how much do we pay for all
> the cases where we do not need it.
>
> Thinking a bit more about this:
>
> For fixed width values (ints, floats, etc) we can be exact without cost.
> That leaves BINARY. To be exact for BINARY we have to pay a cost (one byte
> for exact/inexact plus the full string) or we chop it and be inexact. The
> question is how often we get closer to the worst case and how much does it
> matter?
>
> Intuitively for BINARY to have the same max across a large percentage of
> the row group population, I think we either have:
> 1. low number of values (not many row groups). I guess in that case it is
> not so bad, we scan it all.
> 2. low NDV but lots of values. In this case the column is almost certainly
> dictionary encoded. In this case aggregate pushdown can be done on the
> dictionary without decoding the data pages. That's typically one extra
> fetch and single (dictionary) page scan per row group.
>
> If my intuition is correct perhaps the value of exact BINARY min/max is low
> enough to not be necessary. I can see if this can be instrumented and
> backed by data.
>
>
> Another option is to keep the complexity, make inexact the default (and
> thus not pay bytes for it on the wire) and allow engines to emit exact
> BINARY if they so desire.
>

Re: [DISCUSS] Clarify min-max truncation in Parquet statistics

Reply via email to