Re: POC: GROUP BY optimization

Andrey Lepikhov Fri, 21 Jan 2022 03:09:31 -0800

On 7/22/21 3:58 AM, Tomas Vondra wrote:

4) I'm not sure it's actually a good idea to pass tuplesPerPrevGroup to
estimate_num_groups_incremental. In principle yes, if we use "group
size" from the previous step, then the returned value is the number of
new groups after adding the "new" pathkey.
But even if we ignore the issues with amplification mentioned in (3),
there's an issue with non-linear behavior in estimate_num_groups,
because at some point it's calculating


    D(N,n,p) = n * (1 - ((N-p)/N)^(N/n))

where N - total rows, p - sample size, n - number of distinct values.
And if we have (N1,n1) and (N2,n2) then the ratio of calculated
estimated (which is pretty much what calculating group size does)

    D(N2,n2,p2) / D(N1,n1,p1)

which will differ depending on p1 and p2. And if we're tweaking the
tuplesPerPrevGroup all the time, that's really annoying, as it may make
the groups smaller or larger, which is unpredictable and annoying, and I
wonder if it might go against the idea of penalizing tuplesPerPrevGroup
to some extent.
We could simply use the input "tuples" value here, and then divide the
current and previous estimate to calculate the number of new groups.

tuplesPerPrevGroup is only a top boundary for the estimation. I think,we should use result of previous estimation as a limit for the nextduring incremental estimation. Maybe we simply limit thetuplesPerPrevGroup value to ensure of monotonic non-growth of thisvalue? - see in attachment patch to previous fixes.


--
regards,
Andrey Lepikhov
Postgres Professional

diff --git a/src/backend/optimizer/path/costsize.c 
b/src/backend/optimizer/path/costsize.c
index 70af9c91d5..4e26cd275d 100644
--- a/src/backend/optimizer/path/costsize.c
+++ b/src/backend/optimizer/path/costsize.c
@@ -2025,8 +2025,10 @@ compute_cpu_sort_cost(PlannerInfo *root, List *pathkeys, 
int nPresortedKeys,
                /*
                 * Real-world distribution isn't uniform but now we don't have 
a way to
                 * determine that, so, add multiplier to get closer to worst 
case.
+                * Guard this value against irrational decisions.
                 */
-               tuplesPerPrevGroup = ceil(1.5 * tuplesPerPrevGroup / nGroups);
+               tuplesPerPrevGroup = Min(tuplesPerPrevGroup,
+                                                                ceil(1.5 * 
tuplesPerPrevGroup / nGroups));
 
                /*
                 * We could skip all following columns for cost estimation, 
because we

Re: POC: GROUP BY optimization

Reply via email to