adriangb commented on issue #8078: URL: https://github.com/apache/datafusion/issues/8078#issuecomment-3389845185
One use case for `Distribution` I wanted to explore that is compatible with Parquet is what I'll call a "footer table sample". I don't remember where I heard of this first or what I should call it, but I did discuss it with Hannes of DuckDB and it sounds like a really cool idea. TLDR is it's expensive to randomly sample compressed columnar storage like Parquet, but if you store a pre-sampled portion of the file e.g. as the last row group you can get very good estimates for all kinds of things (filter selectivity, cardinality of any column, etc.) and it's very efficient IO-wise to get that data (it's all nicely packed into 1 read unit). My thought is that something like this could be used to easily get *estimated* distributions and cardinality from the data. > I also feel that there's a slight conflict of interest or at least two camps here: > > * statistics always-correct optimizers: Some people use statistics for optimizers like join ordering. There a wrong statistics often only results in slower execution, but never wrong results. That is kinda reflected in a lot of statistics calculation in the DF code base. > * correctness: Some plan transformers (InfluxData for example has one) rely on the statistics that actually can make hard promises, i.e. "all values are FOR SURE in this range". In that case, you really wanna be picky about what the stats do. I agree with this. My biggest issue with the current statistics is that we only have `Exact` and `Inexact` but `Inexact` isn't really what you want for the second case you list, you want something like `Bounded`. I also think the current statistics is lacking info like the size of each column which is much better than the total file size in almost every use case (most queries are not `select *`). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
