asolimando commented on PR #20926: URL: https://github.com/apache/datafusion/pull/20926#issuecomment-4183702274
TLDR: I agree picking a reference system is reasonable. However, research ([Leis et al. 2015](http://www.vldb.org/pvldb/vol9/p204-leis.pdf), [Leis et al. 2018](https://doi.org/10.1007/s00778-017-0480-7), [Leis et al. 2025](http://www.vldb.org/pvldb/vol18/p5531-viktor.pdf)) shows CBO quality is dominated by cardinality estimation, not the cost model. Cardinality is an "absolute" quantity (the same plan produces the same row counts in any system), so it's portable across systems and worth improving beyond any single reference. Cost model and CBO, however, are more tied to the specific internals of any given database, and therefore less portable. As a framework, DataFusion can't assume one cost model fits all downstream use cases, so we should invest in sensible defaults for cardinality estimation while making the full statistics and cost model stack pluggable (https://github.com/apache/datafusion/issues/21120 and similar issues to come). Apologies for the wall of text @2010YOUY01, but you raise a very deep and interesting point below, worth a detailed reply: > ### Regarding reference system approach > If we follow a reference-system approach, I suggest going a bit further: identify the best-performing system and try to port its related components more holistically. It’s possible that one system works significantly better than another, and that its cardinality estimation is co-designed with other components like the cost model. Taking ideas from multiple reference systems may not yield good results. > > The metrics for deciding reference system IMO is: > > 1. Simple to implement > 2. With comprehensive documentations, explainable > 3. Overall improvement on benchmarks like tpcds/join-order-benchmark > > DuckDB seems to perform well — they have a thesis (https://blobs.duckdb.org/papers/tom-ebergen-msc-thesis-join-order-optimization-with-almost-no-statistics.pdf) describing their approach, and they’ve reported benchmark results somewhere. Perhaps we can compare it with other reference systems like Spark/Trino similarly, and then pick one reference system and stick to it. I agree it's reasonable to take a single system as reference, and that the cost model might be co-designed with its cardinality estimation to some extent. That said, I'd note a few things: ### join-order-benchmark (JOB) Since you mentioned the JOB, which was proposed by Leis et al. in ["How Good Are Query Optimizers, Really?" (PVLDB 2015)](http://www.vldb.org/pvldb/vol9/p204-leis.pdf) exactly to go beyond TPC-DS/TPC-H and other common DB benchmarks, which aren't stressing join order algorithms enough for realistic cases. I am a big fan of JOB, and I wish DB systems would incorporate it more systematically similarly to TPC-* benchmarks. This said, JOB has limitations too, it doesn't feature group by queries (https://github.com/gregrahn/join-order-benchmark). It therefore doesn't cover all interesting aspects of statistics/CBO, even if join-ordering is a prominent problem. Another important conclusion from those papers is that the quality of CBO is largely dominated by that of cardinality estimation, less by the cost model. So getting cardinality right matters more than the specific cost formula consuming it. Cardinality estimation is a proxy for a "real" and measurable quantity, the real cardinality you will observe at runtime, and it's therefore an "absolute" quantity that doesn't depend on the system: if the same exact plan was executed by any two databases, they would have the same exact number of rows at each operator. The cost-model is, however, closer to the single implementations, as it is used to rank plans, and be able to select the "cheapest", so evaluating the cost model and CBO, is best done by judging the plan selection effect on existing benchmarks (but at the moment we don’t have lots of CBO decisions, so you won’t see effects on runtime for those benchmarks, for now). This supports my feeling that we should strive to improve over the reference database when it makes sense. If a better cardinality estimation (closer to real numbers we can observe at runtime) translates in worse plans, I think it's the cost-model that should be refined. For this reason I find https://github.com/apache/datafusion/pull/20292 very compelling for statistics, as it compares estimates vs real quantities, without the noise that the cost model could be adding. ### On DuckDB specifically The thesis you linked is an interesting read, but it covers only join ordering, it won't help us much with statistics, but it will come handy for CBO (and it's an example of CBO without many detailed statistics, which is a relevant use-case for some lakehouse scenarios). ### DataFusion is a framework: Unlike a single database like DuckDB, Trino, or Spark, DataFusion serves diverse downstream use cases where the effectiveness of statistics and CBO can vary significantly. A single DB can take more radical decisions, DataFusion should provide a sensible default but invest heavily in overridability and customization. That's why, alongside improving the default implementation (like this PR), I'm proposing pluggable statistics machinery (like https://github.com/apache/datafusion/issues/21120 for expressions, but more to come for operator-level stats and CBO), to allow downstream systems to customize statistics, as they see fit. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
