ozankabak commented on PR #14699: URL: https://github.com/apache/datafusion/pull/14699#issuecomment-2673204321
@edmondop, maybe I can offer some clarification here. What we want is a computational framework that gives us how statistical quantities transform under functions defined by expressions. Once we have the machinery that does this, we can build all sorts of layers on top of it for answering column-level and table-level statistical questions. So how do we go about doing this? There are four cases in "forward" mode: 1. Statistical quantity with a *known/estimated* distribution ----> expression ----> New statistical quantity with a *known/estimated* distribution. 2. Statistical quantity with a *known/estimated* distribution ----> expression ----> New statistical quantity with an *unknown* distribution. 3. Statistical quantity with an *unknown* distribution ----> expression ----> New statistical quantity with an *unknown* distribution. 4. Statistical quantity with known (or estimated) distribution ----> expression ----> New statistical quantity with *known/estimated* distribution. Cases 1, 2 and 3 are quite common. Case 4 happens rarely with special types of expressions. There is also the "reverse" mode where have information about the statistics of the result (e.g. when we have a filter that forces a composite expression to be true), which enables us to update our information about the distributions of constituent expressions by recursively applying the Bayes rule. With this general explanation out of the way, let's go back to the specifics of your question. In this light, your question about histograms basically boils down to how do we represent an unknown distribution. In the initial implementation, we represent it using various summary statistics. If this turns out to be insufficient, we can add an attribute to the unknown distribution variant of the enum to store histogram information as well. If we do this, the entire machinery will stay the same -- we will only need to update the encapsulated code that handles how unknown distributions are updated. So it would actually be a small-ish PR to do this 🙂 I hope this helps. Thanks for helping with reviewing 🚀 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org