asolimando commented on PR #20926: URL: https://github.com/apache/datafusion/pull/20926#issuecomment-4197819270
> @asolimando Thanks for the detailed and thoughtful reply — really helpful and gave me a lot to think about. > > > If a better cardinality estimation (closer to real numbers we can observe at runtime) translates in worse plans, I think it's the cost-model that should be refined. For this reason I find #20292 very compelling for statistics, as it compares estimates vs real quantities, without the noise that the cost model could be adding. > > That makes sense to me. In particular, I agree with the distinction that cardinality estimation is more portable across systems, while the cost model is much more tied to a specific implementation. > > I’ve gone through some of those materials. This might sound a bit pessimistic, but I feel that cardinality estimation is fundamentally hard to get right, and making it mostly correct requires significant engineering effort (which may be challenging for DataFusion at the moment). > > My concern is mostly about how we get started in practice. Even if the long-term goal is to improve CE on its own merits, it seems useful to begin with something simple, and maybe it would require co-designs to avoid bad plans, that can be simpler than a accurate cost model. > > For example, in this PR’s aggregation estimation, there are at least two simple directions for estimating the output of `SELECT * FROM t GROUP BY a, b`: > > * Assume independence between a and b, i.e. `NDV(a) * NDV(b)`, which may overestimate. > * Assume strong dependency, i.e. `max(NDV(a), NDV(b))`. > > Some system might always choose one direction, and use external mechanisms to fix bad plans. We might also want to make consistent assumptions to start, and see how to evolve overtime. > > Using a single reference system, or just simple heuristics are all good I think, just want to ensure we're not mixing ideas from multiple systems too early. Thanks a lot for sharing your POV @2010YOUY01, I find it quite pragmatic and I agree on keeping it simple at the beginning and refine later when we will have better ways to evaluate the practical impact of choices around CE. Hopefully future work on making statistics propagation overridable will also relieve some pressure as downstream systems will always be free to change how they see fit. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
