Re: [PR] Estimate aggregate output rows using existing NDV statistics [datafusion]

via GitHub Tue, 07 Apr 2026 02:07:28 -0700


asolimando commented on PR #20926:
URL: https://github.com/apache/datafusion/pull/20926#issuecomment-4197819270


   > @asolimando Thanks for the detailed and thoughtful reply — really helpful 
and gave me a lot to think about.
   > 
   > > If a better cardinality estimation (closer to real numbers we can 
observe at runtime) translates in worse plans, I think it's the cost-model that 
should be refined. For this reason I find #20292 very compelling for 
statistics, as it compares estimates vs real quantities, without the noise that 
the cost model could be adding.
   > 
   > That makes sense to me. In particular, I agree with the distinction that 
cardinality estimation is more portable across systems, while the cost model is 
much more tied to a specific implementation.
   > 
   > I’ve gone through some of those materials. This might sound a bit 
pessimistic, but I feel that cardinality estimation is fundamentally hard to 
get right, and making it mostly correct requires significant engineering effort 
(which may be challenging for DataFusion at the moment).
   > 
   > My concern is mostly about how we get started in practice. Even if the 
long-term goal is to improve CE on its own merits, it seems useful to begin 
with something simple, and maybe it would require co-designs to avoid bad 
plans, that can be simpler than a accurate cost model.
   > 
   > For example, in this PR’s aggregation estimation, there are at least two 
simple directions for estimating the output of `SELECT * FROM t GROUP BY a, b`:
   > 
   > * Assume independence between a and b, i.e. `NDV(a) * NDV(b)`, which may 
overestimate.
   > * Assume strong dependency, i.e. `max(NDV(a), NDV(b))`.
   > 
   > Some system might always choose one direction, and use external mechanisms 
to fix bad plans. We might also want to make consistent assumptions to start, 
and see how to evolve overtime.
   > 
   > Using a single reference system, or just simple heuristics are all good I 
think, just want to ensure we're not mixing ideas from multiple systems too 
early.
   
   Thanks a lot for sharing your POV @2010YOUY01, I find it quite pragmatic and 
I agree on keeping it simple at the beginning and refine later when we will 
have better ways to evaluate the practical impact of choices around CE.
   
   Hopefully future work on making statistics propagation overridable will also 
relieve some pressure as downstream systems will always be free to change how 
they see fit.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] Estimate aggregate output rows using existing NDV statistics [datafusion]

Reply via email to