asolimando commented on PR #21122:
URL: https://github.com/apache/datafusion/pull/21122#issuecomment-4574439406

   @2010YOUY01, thank you for the thoughtful analysis, the distribution 
perspective is indeed a valuable perspective I haven't considered yet.
   
   I see the distribution assumption differently though: at the base table 
level the underlying distribution can be known precisely (e.g., from KLL 
sketches or histograms), but after the first layer of propagation it's already 
an approximation. I think the distribution belongs in the API request rather 
than as metadata propagated through the tree. Different consumers want 
different assumptions for the same expression, e.g., sizing a hash table to 
avoid OOMs wants the worst case (uniform, highest NDV), while join ordering 
might want a more conservative estimate. This is also why I think Stats V2 had 
the right intuition but the wrong model, propagating distribution objects 
treats an assumption as a fact.
   
   In the framework I had in mind, different distribution assumptions would be 
different provider implementations. But since different callers want different 
assumptions for the same expression, the hint should be per-call. The way I see 
this, would be an extra optional parameter, something like:
   
   ```rust
   fn estimate_distinct_count(
       &self,
       input_stats: &Statistics,
       ctx: &StatisticsContext,
       hint: Option<&DistributionHint>,  // None = uniform
   ) -> Option<usize> { None }
   ```
   
   This also reinforces keeping methods separate: different consumers ask for 
different stats under different assumptions, and a unified `analyze()` would 
conflate these independent concerns.
   
   Regarding lazy computation of stats more broadly, I think #22300 is the 
right way forward, it introduces granular statistics requests at the 
`TableProvider` level. It's related but at a different layer (scan-level vs 
expression-level). I will follow this under the same epic (#21120).
   
   Does this match your understanding? Happy to continue this discussion in the 
`PhysicalExpr` extension PR, once #21815 lands.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to