[PR] [Improvement](statistics)Improve stats sample strategy (#26435) [doris]

via GitHub Mon, 13 Nov 2023 01:47:27 -0800


Jibing-Li opened a new pull request, #26890:
URL: https://github.com/apache/doris/pull/26890


   backport https://github.com/apache/doris/pull/26435
   
   Improve the accuracy of sample stats collection. For non distribution 
columns, use `n*d / (n - f1 + f1*n/N)`
   
   where `f1` is the number of distinct values that occurred exactly once in 
our sample of n rows (from a total of N), and `d` is the total number of 
distinct values in the sample.
   
   For distribution columns, use `ndv(n) * fraction of tablets sampled` for NDV.
   
   For very large tablet to sample, use limit to control the total lines to 
scan (for non key column only, because key column is sorted and will be 
inaccurate using limit).
   
   ## Proposed changes
   
   Issue Number: close #xxx
   
   <!--Describe your changes.-->
   
   ## Further comments
   
   If this is a relatively large or complex change, kick off the discussion at 
[[email protected]](mailto:[email protected]) by explaining why you 
chose the solution you did and what alternatives you considered, etc...
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[PR] [Improvement](statistics)Improve stats sample strategy (#26435) [doris]

Reply via email to