Re: [PR] StatisticsV2: initial statistics framework redesign [datafusion]

via GitHub Sun, 23 Feb 2025 06:01:48 -0800


xudong963 commented on PR #14699:
URL: https://github.com/apache/datafusion/pull/14699#issuecomment-2676878054


   > Specifically, if we let `X_i` denote the value of `i`th row of column `X`, 
the maximum value for the column would be `M = max(X_1, ..., X_N)` with `N` 
being the number of rows. Given probabilistic information on the possible 
values of an arbitrary `X_i`, we can also make a probabilistic guess on what 
`M` can be.
   
   This makes a lot of sense, thanks for your clear explanation. Now I 
understand how the distribution works and the difference between the current 
statistics model and the original Min/max/nv.
   
   
   > There is no reason why we can't use statistical tests to "recognize" 
distributions and use recognized distributions instead of directly falling back 
to unknown distributions in such cases.
   
   Yes, `Sample` does have great significance for the lack of statistics. From 
my experience, I've built the whole optimizer, the annoying problem is that the 
statistics are often accurate due to frequent data increases and lack due to 
unstructured data, under the context, `Sample` will show its muscle.
   
   > In the worst case, all the calculus will work through unknown 
distributions and we will not be in a worse position than where we were before 
(sans bugs)
   
   Make sense, after statisticsv2, the worst situation is to fall back to the 
origin case.
   
   > It may be an interesting idea to write something up once we finalize all 
the details.
   
   Thanks, looking forward!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] StatisticsV2: initial statistics framework redesign [datafusion]

Reply via email to