ozankabak commented on PR #14699:
URL: https://github.com/apache/datafusion/pull/14699#issuecomment-2673204321

   @edmondop, maybe I can offer some clarification here. What we want is a 
computational framework that gives us how statistical quantities transform 
under functions defined by expressions. Once we have the machinery that does 
this, we can build all sorts of layers on top of it for answering column-level 
and table-level statistical questions.
   
   So how do we go about doing this? There are four cases in "forward" mode:
   1. Statistical quantity with a *known/estimated* distribution ----> 
expression ----> New statistical quantity with a *known/estimated* distribution.
   2. Statistical quantity with a *known/estimated* distribution ----> 
expression ----> New statistical quantity with an *unknown* distribution.
   3. Statistical quantity with an *unknown* distribution ----> expression 
----> New statistical quantity with an *unknown* distribution.
   4. Statistical quantity with known (or estimated) distribution ----> 
expression ----> New statistical quantity with *known/estimated* distribution.
   
   Cases 1, 2 and 3 are quite common. Case 4 happens rarely with special types 
of expressions. There is also the "reverse" mode where have information about 
the statistics of the result (e.g. when we have a filter that forces a 
composite expression to be true), which enables us to update our information 
about the distributions of constituent expressions by recursively applying the 
Bayes rule.
   
   With this general explanation out of the way, let's go back to the specifics 
of your question. In this light, your question about histograms basically boils 
down to how do we represent an unknown distribution. In the initial 
implementation, we represent it using various summary statistics. If this turns 
out to be insufficient, we can add an attribute to the unknown distribution 
variant of the enum to store histogram information as well. If we do this, the 
entire machinery will stay the same -- we will only need to update the 
encapsulated code that handles how unknown distributions are updated. So it 
would actually be a small-ish PR to do this 🙂 
   
   I hope this helps. Thanks for helping with reviewing 🚀 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Reply via email to