2010YOUY01 commented on PR #21122:
URL: https://github.com/apache/datafusion/pull/21122#issuecomment-4575896814

   I still got some questions: suppose we have a separated API, and in order to 
implement each API accurately, the intermediate representation propagated 
through the expression tree must be some complex-ish model. (Maybe "data 
synopsis" is a better term here; I used "distribution summary" before)
   
   For example,
   
   ```rust
   get_ndv(a + b)
   
   a: UniformDistribution(min = 0, max = 100, ndv = 100)
   b: UniformDistribution(min = 200, max = 300, ndv = 10)
   ```
   
   Then we can get `ndv(a + b) = 200` accurately. If we also want to calculate 
`get_range(a + b)`, the same data synopsis would need to be propagated again, 
so the separate-API approach might lead to duplicated implementations, and a 
unified approach seem to be simpler here.
   
   > I see the distribution assumption differently though: at the base table 
level the underlying distribution can be known precisely (e.g., from KLL 
sketches or histograms), but after the first layer of propagation it's already 
an approximation. I think the distribution belongs in the API request rather 
than as metadata propagated through the tree. Different consumers want 
different assumptions for the same expression, e.g., sizing a hash table to 
avoid OOMs wants the worst case (uniform, highest NDV), while join ordering 
might want a more conservative estimate. This is also why I think Stats V2 had 
the right intuition but the wrong model, propagating distribution objects 
treats an assumption as a fact.
   
   I agree that different use cases require different behaviors from stat 
propagation. This looks like something that could be implemented as a hint 
argument to guide the propagation strategy.
   That's also a solid point: some complex expressions can make the existing 
statistics invalid or unreliable. I think this requires some fallback 
data-synopsis representation? I'm not sure if a separate API design can make it 
easier.
   
   Note that these are still just my intuitions, and I could be missing 
something. This is really interesting, and I'll think about it more.
   
   Perhaps we can add some concrete examples to the writeup (dataset, SQL query 
subplan/expr to estimate the stat), and then reason about the design 
end-to-end. Right now, it feels like we're jumping directly into implementation 
details, which might make the design decisions unclear to me and others.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to