Re: POC: GROUP BY optimization

Andrei Lepikhov Thu, 18 Jan 2024 01:49:21 -0800

On 16/1/2024 22:05, Alexander Korotkov wrote:

On Tue, Jan 16, 2024 at 4:48 AM Richard Guo <guofengli...@gmail.com> wrote:

* When trying to match the ordering of GROUP BY to that of ORDER BY in
get_useful_group_keys_orderings, this patch checks against the length of
path->pathkeys.  This does not make sense.  I guess this is a typo and
the real intention is to check against root->sort_pathkeys.

Thanks! It is really my blunder - fresh look works.


--- a/src/backend/optimizer/path/pathkeys.c
+++ b/src/backend/optimizer/path/pathkeys.c
@@ -504,7 +504,7 @@ get_useful_group_keys_orderings(PlannerInfo *root, Path 
*path)
                                            root->num_groupby_pathkeys);

         if (n > 0 &&
-           (enable_incremental_sort || n == list_length(path->pathkeys)))
+           (enable_incremental_sort || n == list_length(root->sort_pathkeys)))

However, I think this is also incorrect.  When incremental sort is
disabled, it is reasonable to consider reordering the GROUP BY keys only
if the number of matching pathkeys is equal to the length of
root->group_pathkeys i.e. if 'n == list_length(root->group_pathkeys)'.


Hmm... Why should this be list_length(root->group_pathkeys) while
we're targeting root->sort_pathkeys.  I yet changed that to
list_length(root->sort_pathkeys).

I think, in the first case, when we are trying to arrange group-by keysaccording to underlying pathkeys with incremental sort = off, it makessense to do if we fetch all group-by keys regardless of a more or equalnumber of path keys the incoming path contains. The code and test caseare in step1.txt.

* When the original ordering of GROUP BY keys matches the path's
pathkeys or ORDER BY keys, get_useful_group_keys_orderings would
generate duplicate PathKeyInfos.  Consequently, this duplication would
lead to the creation and addition of identical paths multiple times.
This is not great.  Consider the query below:

create table t (a int, b int);
create index on t (a, b);
set enable_hashagg to off;

explain select count(*) from t group by a, b;
                                     QUERY PLAN
----------------------------------------------------------------------------------
  GroupAggregate  (cost=0.15..97.27 rows=226 width=16)
    Group Key: a, b
    ->  Index Only Scan using t_a_b_idx on t  (cost=0.15..78.06 rows=2260 
width=8)
(3 rows)

The same path with group keys {a, b} is created and added twice.


I tried to address that.

* Part of the work performed in this patch overlaps with that of
preprocess_groupclause.  They are both trying to adjust the ordering of
the GROUP BY keys to match ORDER BY.  I wonder if it would be better to
perform this work only once.


Andrei, could you take a look.

As I see, the PathKeyInfo list often contains duplicated pathkeys,coming from either sort_pathkeys or path->pathkeys orderings. So, I canpropose to check duplicates each time (see step2.txt in attachment).

* When reordering the GROUP BY keys, I think the following checks need
some improvements.

+       /*
+        * Since 1349d27 pathkey coming from underlying node can be in the
+        * root->group_pathkeys but not in the processed_groupClause. So, we
+        * should be careful here.
+        */
+       sgc = get_sortgroupref_clause_noerr(pathkey->pk_eclass->ec_sortref,
+                                           *group_clauses);
+       if (!sgc)
+           /* The grouping clause does not cover this pathkey */
+           break;

I think we need to check or assert 'pathkey->pk_eclass->ec_sortref' is
valid before calling get_sortgroupref_clause_noerr with it.  Also, how
can we guarantee that the returned GROUP BY item is sortable?  Should we
check that with OidIsValid(sgc->sortop)?


Added.

Reviewed it, looks good.

--
regards,
Andrei Lepikhov
Postgres Professional

diff --git a/src/backend/optimizer/path/pathkeys.c 
b/src/backend/optimizer/path/pathkeys.c
index 919d54dd79..1095b73dac 100644
--- a/src/backend/optimizer/path/pathkeys.c
+++ b/src/backend/optimizer/path/pathkeys.c
@@ -489,8 +489,9 @@ get_useful_group_keys_orderings(PlannerInfo *root, Path 
*path)
                n = group_keys_reorder_by_pathkeys(path->pathkeys, &pathkeys, 
&clauses,
                                                                                
   root->num_groupby_pathkeys);
 
-               if (n > 0 && compare_pathkeys(pathkeys, root->group_pathkeys) 
!= PATHKEYS_EQUAL &&
-                       (enable_incremental_sort || n == 
list_length(path->pathkeys)))
+               if (n > 0 &&
+                       (enable_incremental_sort || n == 
root->num_groupby_pathkeys) &&
+                       compare_pathkeys(pathkeys, root->group_pathkeys) != 
PATHKEYS_EQUAL)
                {
                        info = makeNode(PathKeyInfo);
                        info->pathkeys = pathkeys;
diff --git a/src/test/regress/expected/aggregates.out 
b/src/test/regress/expected/aggregates.out
index ca38e78f21..67dd20f375 100644
--- a/src/test/regress/expected/aggregates.out
+++ b/src/test/regress/expected/aggregates.out
@@ -2856,6 +2856,23 @@ explain (COSTS OFF) SELECT x,w FROM btg GROUP BY w,x,y,z 
ORDER BY x*x,z;
                ->  Index Scan using abc on btg
 (8 rows)
 
+SET enable_incremental_sort = off;
+-- The case when the number of incoming subtree path keys is more than
+-- the number of grouping keys.
+CREATE INDEX idx_y_x_z ON btg(y,x,w);
+EXPLAIN (VERBOSE, COSTS OFF)
+SELECT y,x,array_agg(distinct w) FROM btg WHERE y < 0 GROUP BY x,y;
+                     QUERY PLAN                      
+-----------------------------------------------------
+ GroupAggregate
+   Output: y, x, array_agg(DISTINCT w)
+   Group Key: btg.y, btg.x
+   ->  Index Only Scan using idx_y_x_z on public.btg
+         Output: y, x, w
+         Index Cond: (btg.y < 0)
+(6 rows)
+
+RESET enable_incremental_sort;
 DROP TABLE btg;
 -- The case, when scanning sort order correspond to aggregate sort order but
 -- can not be found in the group-by list
diff --git a/src/test/regress/sql/aggregates.sql 
b/src/test/regress/sql/aggregates.sql
index 9880a188d3..524bdfa67d 100644
--- a/src/test/regress/sql/aggregates.sql
+++ b/src/test/regress/sql/aggregates.sql
@@ -1182,7 +1182,6 @@ SELECT balk(hundred) FROM tenk1;
 ROLLBACK;
 
 -- GROUP BY optimization by reorder columns
-
 CREATE TABLE btg AS SELECT
   i % 100 AS x,
   i % 100 AS y,
@@ -1223,8 +1222,11 @@ explain (COSTS OFF) SELECT x,y FROM btg GROUP BY w,x,z,y 
ORDER BY y,x,z,w;
 explain (COSTS OFF) SELECT x,w FROM btg GROUP BY w,x,y,z ORDER BY x*x,z;
 
 SET enable_incremental_sort = off;
-EXPLAIN (VERBOSE, COSTS ON)
-SELECT x,y, array_agg(z) FROM btg WHERE x < 2 GROUP BY y,x;
+-- The case when the number of incoming subtree path keys is more than
+-- the number of grouping keys.
+CREATE INDEX idx_y_x_z ON btg(y,x,w);
+EXPLAIN (VERBOSE, COSTS OFF)
+SELECT y,x,array_agg(distinct w) FROM btg WHERE y < 0 GROUP BY x,y;
 RESET enable_incremental_sort;
 
 DROP TABLE btg;

Re: POC: GROUP BY optimization

Reply via email to