Apologies for a little bit of a thread hi-jack:
> > On one hand we handle the worst case in a > > very fast way but pessimize every other case. > > On the other hand, 1 byte per row group is also absolutely negligible, > unless you have super wide schemas and/or super short / sparse row groups. > How often does this happen? I guess you will find use cases where a very > wide schema is common and you will find use cases where needing min/max > pushdown is super common. You will find cases where one of the two is > absolutely useless, while the other is important. I think this point is pertinent to the metadata discussion we are having. If there is a problem adding N bytes that help certain use-cases (really in practice if we are concerned about this particular use-case it should be bits per column) a lot then I think we should be thinking hard if the new metadata design is flexible enough. In some ways it seems like there are two orthogonal concerns that need to be addressed: 1. Efficiency in serialization format in terms of speed/size. 2. Efficiency in data layout to not pessimize common cases due to I/O concerns of not caching everything needed to be read in a single operation vs being able to have a very small number of possible additional I/O to drastically improve some use-cases. In short, I think we might want to further divide the current into smaller units that are addressable separately, then trying to simply translate/optimize one Flatbuffer monolithic structure (this raises complexity but it feels like it is potentially worth it). Thanks, Micah On Mon, Sep 9, 2024 at 2:14 PM Jan Finis <jpfi...@gmail.com> wrote: > > > > Another option is to keep the complexity, make inexact the default (and > > thus not pay bytes for it on the wire) and allow engines to emit exact > > BINARY if they so desire. > > I would argue that - for correctness - inexact *has to be* the default, as > legacy writers will not write it. So you cannot assume that the missing of > a lately introduced field means anything specific. If the information is > missing, you have to be pessimistic, so you have to assume the worst. > And you are right that this also solves the space problem; at least in > thrift. In thrift, it is possible to just leave it out and save space. I > don't know whether this is possible in the conceived flatbuffers encoding. > If it is possible there as well, then we're good. If it's not, then the > default won't help us. > > > > On one hand we handle the worst case in a > > very fast way but pessimize every other case. > > > > On the other hand, 1 byte per row group is also absolutely negligible, > unless you have super wide schemas and/or super short / sparse row groups. > How often does this happen? I guess you will find use cases where a very > wide schema is common and you will find use cases where needing min/max > pushdown is super common. You will find cases where one of the two is > absolutely useless, while the other is important. > > Intuitively for BINARY to have the same max across a large percentage of > > the row group population, I think we either have: > > > > Here the difference is indeed whether we mean *inexact* or *truncated*. I > agree with you, for *truncated* the chance will be high that a single row > group is the one that is definitely the one with the max, as a truncation > also gives us a lower bound for how imprecise the bound can be. If the > bound is merely labeled **inexact**, it is not given that it is only > truncated, it could also be just inexact, so there is nothing you can tell > about how inexact the bound is. > > Note that currently the statistics field is indeed called "*exact*", not " > *truncated*", so with the current wording, you don't have any guarantees. > The max for a string column could say "zzz" but the real max could be > "aaa". You might ask why a sane writer would do this. It probably wouldn't, > but you cannot assume a writer to be sane. The spec - in its current > wording - allows this, so you have to handle it gracefully. > > > Am Mo., 9. Sept. 2024 um 22:16 Uhr schrieb Alkis Evlogimenos > <alkis.evlogime...@databricks.com.invalid>: > > > > If your min/max are not exact, you need to scan all row groups in the > > worst case. > > > > True except it is a tradeoff. On one hand we handle the worst case in a > > very fast way but pessimize every other case. Thus the probability of > > observing the worst case is important. This is why I said: > > > > > how often do we need that little bit extra and how much do we pay for > all > > the cases where we do not need it. > > > > Thinking a bit more about this: > > > > For fixed width values (ints, floats, etc) we can be exact without cost. > > That leaves BINARY. To be exact for BINARY we have to pay a cost (one > byte > > for exact/inexact plus the full string) or we chop it and be inexact. The > > question is how often we get closer to the worst case and how much does > it > > matter? > > > > Intuitively for BINARY to have the same max across a large percentage of > > the row group population, I think we either have: > > 1. low number of values (not many row groups). I guess in that case it is > > not so bad, we scan it all. > > 2. low NDV but lots of values. In this case the column is almost > certainly > > dictionary encoded. In this case aggregate pushdown can be done on the > > dictionary without decoding the data pages. That's typically one extra > > fetch and single (dictionary) page scan per row group. > > > > If my intuition is correct perhaps the value of exact BINARY min/max is > low > > enough to not be necessary. I can see if this can be instrumented and > > backed by data. > > > > > > Another option is to keep the complexity, make inexact the default (and > > thus not pay bytes for it on the wire) and allow engines to emit exact > > BINARY if they so desire. > > >