Re: [PR] Estimate aggregate output rows using existing NDV statistics [datafusion]

via GitHub Wed, 01 Apr 2026 18:18:10 -0700


2010YOUY01 commented on PR #20926:
URL: https://github.com/apache/datafusion/pull/20926#issuecomment-4173976995


   > @2010YOUY01, apologies for the direct ping, would you be interesting in 
taking a look?
   > 
   > Re. the discussion in [#21120 
(comment)](https://github.com/apache/datafusion/issues/21120#issuecomment-4131266771),
 we have taken Trino and Spark as reference, for grouping sets we went a bit 
further than those systems, but I believe enough details are provided for the 
proposed formula (see [#20926 
(comment)](https://github.com/apache/datafusion/pull/20926#discussion_r2942442151)
 for the discussion, which @buraksenn captured in code comments).
   > 
   > Is this matching your suggestion for CBO improvements you shared in 
[#21120 
(comment)](https://github.com/apache/datafusion/issues/21120#issuecomment-4131266771)?
   > 
   > This is an interesting example as it covers as "porting" of existing 
statistics propagation code from battle-tested systems, and a reasonable 
extension, so I am particularly interested in your feedback to calibrate future 
PRs and reviews.
   
   Thanks for sharing the context! happy to take a look.
   
   I don't have prior experience on CBO, and I'm still working through the 
relevant material. Here are several questions:
   - why we assume independence between grouping keys — is this just a 
simplifying heuristic, or do reference systems intentionally do this because 
overestimating NDV leads to better end-to-end performance (e.g., for join 
reordering)?
   - why we explicitly account for nulls — this adds implementation complexity, 
but seems to have only a minor impact
   
   ### Regarding reference system approach
   If we follow a reference-system approach, I suggest going a bit further: 
identify the best-performing system and try to port its related components more 
holistically. It’s possible that one system works significantly better than 
another, and that its cardinality estimation is co-designed with other 
components like the cost model. Taking ideas from multiple reference systems 
may not yield good results.
   
   The metrics for deciding reference system IMO is:
   1. Simple to implement
   2. With comprehensive documentations, explainable
   3. Overall improvement on benchmarks like tpcds/join-order-benchmark
   
   DuckDB seems to perform well — they have a thesis 
(https://blobs.duckdb.org/papers/tom-ebergen-msc-thesis-join-order-optimization-with-almost-no-statistics.pdf)
 describing their approach, and they’ve reported benchmark results somewhere. 
Perhaps we can compare it with other reference systems like Spark/Trino 
similarly, and then pick one reference system and stick to it.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] Estimate aggregate output rows using existing NDV statistics [datafusion]

Reply via email to