The optimizations here are not flatbuffer specific. There are some nuances about flatbuffers vs Thrift but the issue is higher level.
The discussion has helped me with the design. I left the semantics the same as in the current metadata but made the case where strings are inexact to be much more compact. On Wed, Sep 11, 2024 at 9:12 PM Micah Kornfield <emkornfi...@gmail.com> wrote: > Apologies for a little bit of a thread hi-jack: > > > > > On one hand we handle the worst case in a > > > very fast way but pessimize every other case. > > > > > > On the other hand, 1 byte per row group is also absolutely negligible, > > unless you have super wide schemas and/or super short / sparse row > groups. > > How often does this happen? I guess you will find use cases where a very > > wide schema is common and you will find use cases where needing min/max > > pushdown is super common. You will find cases where one of the two is > > absolutely useless, while the other is important. > > > I think this point is pertinent to the metadata discussion we are having. > If there is a problem adding N bytes that help certain use-cases (really in > practice if we are concerned about this particular use-case it should be > bits per column) a lot then I think we should be thinking hard if the new > metadata design is flexible enough. In some ways it seems like there are > two orthogonal concerns that need to be addressed: > > 1. Efficiency in serialization format in terms of speed/size. > 2. Efficiency in data layout to not pessimize common cases due to I/O > concerns of not caching everything needed to be read in a single operation > vs being able to have a very small number of possible additional I/O to > drastically improve some use-cases. > > In short, I think we might want to further divide the current into smaller > units that are addressable separately, then trying to simply > translate/optimize one Flatbuffer monolithic structure (this raises > complexity but it feels like it is potentially worth it). > > Thanks, > Micah > > On Mon, Sep 9, 2024 at 2:14 PM Jan Finis <jpfi...@gmail.com> wrote: > > > > > > > Another option is to keep the complexity, make inexact the default (and > > > thus not pay bytes for it on the wire) and allow engines to emit exact > > > BINARY if they so desire. > > > > I would argue that - for correctness - inexact *has to be* the default, > as > > legacy writers will not write it. So you cannot assume that the missing > of > > a lately introduced field means anything specific. If the information is > > missing, you have to be pessimistic, so you have to assume the worst. > > And you are right that this also solves the space problem; at least in > > thrift. In thrift, it is possible to just leave it out and save space. I > > don't know whether this is possible in the conceived flatbuffers > encoding. > > If it is possible there as well, then we're good. If it's not, then the > > default won't help us. > > > > > > > On one hand we handle the worst case in a > > > very fast way but pessimize every other case. > > > > > > > On the other hand, 1 byte per row group is also absolutely negligible, > > unless you have super wide schemas and/or super short / sparse row > groups. > > How often does this happen? I guess you will find use cases where a very > > wide schema is common and you will find use cases where needing min/max > > pushdown is super common. You will find cases where one of the two is > > absolutely useless, while the other is important. > > > > Intuitively for BINARY to have the same max across a large percentage of > > > the row group population, I think we either have: > > > > > > > Here the difference is indeed whether we mean *inexact* or *truncated*. I > > agree with you, for *truncated* the chance will be high that a single row > > group is the one that is definitely the one with the max, as a truncation > > also gives us a lower bound for how imprecise the bound can be. If the > > bound is merely labeled **inexact**, it is not given that it is only > > truncated, it could also be just inexact, so there is nothing you can > tell > > about how inexact the bound is. > > > > Note that currently the statistics field is indeed called "*exact*", not > " > > *truncated*", so with the current wording, you don't have any guarantees. > > The max for a string column could say "zzz" but the real max could be > > "aaa". You might ask why a sane writer would do this. It probably > wouldn't, > > but you cannot assume a writer to be sane. The spec - in its current > > wording - allows this, so you have to handle it gracefully. > > > > > > Am Mo., 9. Sept. 2024 um 22:16 Uhr schrieb Alkis Evlogimenos > > <alkis.evlogime...@databricks.com.invalid>: > > > > > > If your min/max are not exact, you need to scan all row groups in the > > > worst case. > > > > > > True except it is a tradeoff. On one hand we handle the worst case in a > > > very fast way but pessimize every other case. Thus the probability of > > > observing the worst case is important. This is why I said: > > > > > > > how often do we need that little bit extra and how much do we pay for > > all > > > the cases where we do not need it. > > > > > > Thinking a bit more about this: > > > > > > For fixed width values (ints, floats, etc) we can be exact without > cost. > > > That leaves BINARY. To be exact for BINARY we have to pay a cost (one > > byte > > > for exact/inexact plus the full string) or we chop it and be inexact. > The > > > question is how often we get closer to the worst case and how much does > > it > > > matter? > > > > > > Intuitively for BINARY to have the same max across a large percentage > of > > > the row group population, I think we either have: > > > 1. low number of values (not many row groups). I guess in that case it > is > > > not so bad, we scan it all. > > > 2. low NDV but lots of values. In this case the column is almost > > certainly > > > dictionary encoded. In this case aggregate pushdown can be done on the > > > dictionary without decoding the data pages. That's typically one extra > > > fetch and single (dictionary) page scan per row group. > > > > > > If my intuition is correct perhaps the value of exact BINARY min/max is > > low > > > enough to not be necessary. I can see if this can be instrumented and > > > backed by data. > > > > > > > > > Another option is to keep the complexity, make inexact the default (and > > > thus not pay bytes for it on the wire) and allow engines to emit exact > > > BINARY if they so desire. > > > > > >