Re: [PR] Estimate aggregate output rows using existing NDV statistics [datafusion]

via GitHub Fri, 03 Apr 2026 12:17:11 -0700


asolimando commented on PR #20926:
URL: https://github.com/apache/datafusion/pull/20926#issuecomment-4183702274


   TLDR: I agree picking a reference system is reasonable. However, research 
([Leis et al. 2015](http://www.vldb.org/pvldb/vol9/p204-leis.pdf), [Leis et al. 
2018](https://doi.org/10.1007/s00778-017-0480-7), [Leis et al. 
2025](http://www.vldb.org/pvldb/vol18/p5531-viktor.pdf)) shows CBO quality is 
dominated by cardinality estimation, not the cost model. Cardinality is an 
"absolute" quantity (the same plan produces the same row counts in any system), 
so it's portable across systems and worth improving beyond any single 
reference. Cost model and CBO, however, are more tied to the specific internals 
of any given database, and therefore less portable. As a framework, DataFusion 
can't assume one cost model fits all downstream use cases, so we should invest 
in sensible defaults for cardinality estimation while making the full 
statistics and cost model stack pluggable 
(https://github.com/apache/datafusion/issues/21120 and similar issues to come).
   
   Apologies for the wall of text @2010YOUY01, but you raise a very deep and 
interesting point below, worth a detailed reply:
   
   > ### Regarding reference system approach
   > If we follow a reference-system approach, I suggest going a bit further: 
identify the best-performing system and try to port its related components more 
holistically. It’s possible that one system works significantly better than 
another, and that its cardinality estimation is co-designed with other 
components like the cost model. Taking ideas from multiple reference systems 
may not yield good results.
   > 
   > The metrics for deciding reference system IMO is:
   > 
   > 1. Simple to implement
   > 2. With comprehensive documentations, explainable
   > 3. Overall improvement on benchmarks like tpcds/join-order-benchmark
   > 
   > DuckDB seems to perform well — they have a thesis 
(https://blobs.duckdb.org/papers/tom-ebergen-msc-thesis-join-order-optimization-with-almost-no-statistics.pdf)
 describing their approach, and they’ve reported benchmark results somewhere. 
Perhaps we can compare it with other reference systems like Spark/Trino 
similarly, and then pick one reference system and stick to it.
   
   I agree it's reasonable to take a single system as reference, and that the 
cost model might be co-designed with its cardinality estimation to some extent. 
That said, I'd note a few things:
   
   ### join-order-benchmark (JOB)
   Since you mentioned the JOB, which was proposed by Leis et al. in ["How Good 
Are Query Optimizers, Really?" (PVLDB 
2015)](http://www.vldb.org/pvldb/vol9/p204-leis.pdf) exactly to go beyond 
TPC-DS/TPC-H and other common DB benchmarks, which aren't stressing join order 
algorithms enough for realistic cases.
   
   I am a big fan of JOB, and I wish DB systems would incorporate it more 
systematically similarly to TPC-* benchmarks. This said, JOB has limitations 
too, it doesn't feature group by queries 
(https://github.com/gregrahn/join-order-benchmark). It therefore doesn't cover 
all interesting aspects of statistics/CBO, even if join-ordering is a prominent 
problem.
   
   Another important conclusion from those papers is that the quality of CBO is 
largely dominated by that of cardinality estimation, less by the cost model. So 
getting cardinality right matters more than the specific cost formula consuming 
it.
   
   Cardinality estimation is a proxy for a "real" and measurable quantity, the 
real cardinality you will observe at runtime, and it's therefore an "absolute" 
quantity that doesn't depend on the system: if the same exact plan was executed 
by any two databases, they would have the same exact number of rows at each 
operator.
   
   The cost-model is, however, closer to the single implementations, as it is 
used to rank plans, and be able to select the "cheapest", so evaluating the 
cost model and CBO, is best done by judging the plan selection effect on 
existing benchmarks (but at the moment we don’t have lots of CBO decisions, so 
you won’t see effects on runtime for those benchmarks, for now).
   
   This supports my feeling that we should strive to improve over the reference 
database when it makes sense.
   
   If a better cardinality estimation (closer to real numbers we can observe at 
runtime) translates in worse plans, I think it's the cost-model that should be 
refined. For this reason I find https://github.com/apache/datafusion/pull/20292 
very compelling for statistics, as it compares estimates vs real quantities, 
without the noise that the cost model could be adding.
   
   ### On DuckDB specifically
   The thesis you linked is an interesting read, but it covers only join 
ordering, it won't help us much with statistics, but it will come handy for CBO 
(and it's an example of CBO without many detailed statistics, which is a 
relevant use-case for some lakehouse scenarios).
   
   ### DataFusion is a framework:
   Unlike a single database like DuckDB, Trino, or Spark, DataFusion serves 
diverse downstream use cases where the effectiveness of statistics and CBO can 
vary significantly. A single DB can take more radical decisions, DataFusion 
should provide a sensible default but invest heavily in overridability and 
customization.
   
   That's why, alongside improving the default implementation (like this PR), 
I'm proposing pluggable statistics machinery (like 
https://github.com/apache/datafusion/issues/21120 for expressions, but more to 
come for operator-level stats and CBO), to allow downstream systems to 
customize statistics, as they see fit.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] Estimate aggregate output rows using existing NDV statistics [datafusion]

Reply via email to