2010YOUY01 commented on PR #20926:
URL: https://github.com/apache/datafusion/pull/20926#issuecomment-4188044669

   @asolimando Thanks for the detailed and thoughtful reply — really helpful 
and gave me a lot to think about.
   
   > If a better cardinality estimation (closer to real numbers we can observe 
at runtime) translates in worse plans, I think it's the cost-model that should 
be refined. For this reason I find #20292 very compelling for statistics, as it 
compares estimates vs real quantities, without the noise that the cost model 
could be adding.
   
   I’ve gone through some of those materials. This might sound a bit 
pessimistic, but I feel that cardinality estimation is fundamentally hard to 
get right, and making it mostly correct requires significant engineering effort 
(which may be challenging for DataFusion at the moment).
   
   Given that, I think it makes sense to start with simple approaches and look 
for mechanisms outside of CE to avoid disastrous plans.
   
   For example, in this PR’s aggregation estimation, there are two 
straightforward heuristics:
   
   For estimating the output of `SELECT * FROM t GROUP BY a, b`:
   1. Assume independence between `a` and `b`, i.e., `NDV(a) * NDV(b)` — this 
may overestimate.
   2. Assume full dependency, i.e., `max(NDV(a), NDV(b))`.
   
   I imagine a reference system might consistently favor one direction (e.g., 
always overestimate) and rely on other mechanisms (such as safeguards in the 
executor) to avoid bad plans.
   
   If we can identify such a reference system, that would be ideal. Otherwise, 
starting with simple heuristics seems like a reasonable first step. One thing 
I’d like to avoid is mixing ideas from multiple systems without a consistent 
set of assumptions, as they may be based on fundamentally different design 
choices.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to