[HACKERS] join removal

Robert Haas Sun, 18 Jan 2009 19:56:58 -0800

Apologies for submitting new patches while we're still in the midst of
CommitFest:November, but I think I've more or less come to the end of
the reviewing that I can usefully do for 8.4 (or at least, I haven't
had any requests lately, feel free...) and I don't want to wait
forever to start thinking about 8.5.


This patch is based on Simon Riggs' earlier work on join removal,
although I don't believe I've actually used any of his code.  Tom
Lane's review comments on Simon's code were invaluable in getting this
to work.  I also understand that Simon plans to do more work in this
area, so if this ends up getting sucked into his work or visca versa
(or this ends up getting superseded altogether), that is fine.

http://archives.postgresql.org/pgsql-patches/2008-08/msg00035.php
http://archives.postgresql.org/pgsql-patches/2008-09/msg00024.php

The theoretical underpinning of this patch is as follows: Joins can
affect the result of a query in one of two ways: either they change
the set of rows returned (by either increasing it or decreasing it),
or they provide attributes.  To remove a join, we need to prove that
neither of these things is possible.  It's pretty easy to test that
the join isn't providing any attributes above the level of the join,
and you'll see that rel_attrs_needed_above_join() handles that here.
Verifying that the number of rows returned isn't changed by the join
is harder.

For the sake of simplicity, this patch only attempts to handle LEFT
JOINs.  Specifically, it attempts to prove that the results of
scanning the outer side of the joinrel will be identical to the
results of performing a merge-join, up to sorting.  Nothing that
happens on the inner side of the join can reduce the number of output
rows (except via a WHERE-clause qual that will fail the
rel_attrs_needed_above_join test anyway), so the only thing we need to
worry about is the possibility that the join might increase the output
row count.  To do that, we need to know that there will be a
sufficiently strong unique-ness property on the rows emitted by the
inner side of the join.  I believe that there are two main cases in
which we could potentially prove this: the inner side of the join is
an index scan (without resort) on a unique index, or the inner side of
the join is a subselect with a group by clause over the join columns.
For now, I'm only attempting to handle the first case, though it might
be nice to eventually handle the other as well.

Simon's original patch starts by checking whether the RelOptInfo for
the relation to be dropped is a base relation, but that approach seems
somewhat limiting, because we certainly want to be able to remove
multiple joins in succession.  Setting the paths for the joinrel {B C}
to the paths for B doesn't help us when we try to form {A B C},
because now the join of {A} to {B C} is not a join against a base rel
and join removal fails.  The other problem is that at that level we
don't really understand what's up with the join clauses: is the join
operator even an equality operator?  There is finally the problem that
while we can get at the relevant IndexOptInfo structures and figure
out which combinations of attributes have unique indices, I can't
figure out any sane way to relate those attribute numbers back to the
join clauses.

So I've taken the alternative approach of working on the level of
paths.  sort_outer_and_inner(), before forming a join path, checks
whether there happens to be a unique index with the same pathkeys as
the inner side of the path.  If so, and if no attributes are needed
from the inner side of the join, the it pulls up all the paths from
the outerrel into the joinrel and exits (really, it should skip the
rest of add_paths_to_joinrel() as well, but it didn't seem worth
refactoring that without some validation of the basic approach).  This
approach is limiting because sort_outer_and_inner() doesn't try every
possible combination of pathkeys that might be helpful to us, but it
seems likely to be adequate here for the same reasons that
sort_outer_and_inner() gets away with it generally: the list of
mergeclauses tends to be real short.

(I think that if we were trying to optimize away an INNER join, this
approach would be inadequate, because any non-merge-joinable join
clauses could change the number of output rows.  Here that can't
happen: the most they can do is force the inner side to NULL, but if
the inner side isn't being used that doesn't matter anyway.  Of
course, for INNER JOINs, we'd also need to worry about the
merge-joinable clauses themselves stripping rows from the output due
to a failure to match; we'd need to check for foreign key
constraints.)

I have a feeling there are probably lingering bugs in here, but it
passes regression tests and some sample queries I tried still return
the correct answers even after nuking one or more joins.

Any comments appreciated,

...Robert

*** a/src/backend/optimizer/path/costsize.c
--- b/src/backend/optimizer/path/costsize.c
***************
*** 108,113 **** bool		enable_hashagg = true;
--- 108,114 ----
  bool		enable_nestloop = true;
  bool		enable_mergejoin = true;
  bool		enable_hashjoin = true;
+ bool		enable_joinremoval = true;
  
  typedef struct
  {
*** a/src/backend/optimizer/path/joinpath.c
--- b/src/backend/optimizer/path/joinpath.c
***************
*** 41,46 **** static List *select_mergejoin_clauses(PlannerInfo *root,
--- 41,49 ----
  						 RelOptInfo *innerrel,
  						 List *restrictlist,
  						 JoinType jointype);
+ static bool rel_attrs_needed_above_join(RelOptInfo *rel, RelOptInfo *joinrel);
+ static bool test_join_removal(RelOptInfo *joinrel, RelOptInfo *rel,
+ 	List *inner_pathkeys_for_merge, bool *removejoinOK);
  
  
  /*
***************
*** 158,163 **** sort_inner_and_outer(PlannerInfo *root,
--- 161,167 ----
  					 SpecialJoinInfo *sjinfo)
  {
  	bool		useallclauses;
+ 	bool		removejoinOK = false;
  	Path	   *outer_path;
  	Path	   *inner_path;
  	List	   *all_pathkeys;
***************
*** 169,176 **** sort_inner_and_outer(PlannerInfo *root,
  	 */
  	switch (jointype)
  	{
- 		case JOIN_INNER:
  		case JOIN_LEFT:
  		case JOIN_SEMI:
  		case JOIN_ANTI:
  		case JOIN_UNIQUE_OUTER:
--- 173,181 ----
  	 */
  	switch (jointype)
  	{
  		case JOIN_LEFT:
+ 			removejoinOK = enable_joinremoval;
+ 		case JOIN_INNER:
  		case JOIN_SEMI:
  		case JOIN_ANTI:
  		case JOIN_UNIQUE_OUTER:
***************
*** 276,281 **** sort_inner_and_outer(PlannerInfo *root,
--- 281,295 ----
  												  cur_mergeclauses,
  												  outerkeys);
  
+ 		/* Join removal. */
+ 		if (removejoinOK
+ 			&& test_join_removal(joinrel, innerrel, innerkeys, &removejoinOK))
+ 		{
+ 			add_paths_from_list(joinrel, outerrel->pathlist);
+ 			/* No point in considering any more join paths. */
+ 			return;
+ 		}
+ 
  		/* Build pathkeys representing output sort order */
  		merge_pathkeys = build_join_pathkeys(root, joinrel, jointype,
  											 outerkeys);
***************
*** 1048,1050 **** select_mergejoin_clauses(PlannerInfo *root,
--- 1062,1196 ----
  
  	return result_list;
  }
+ 
+ /*
+  * test_join_removal
+  *    Attempt to prove that joining against the inner relation can't change
+  *    the number of output rows.
+  */
+ static bool
+ test_join_removal(RelOptInfo *joinrel, RelOptInfo *innerrel,
+ 	List *inner_pathkeys_for_merge, bool *removejoinOK)
+ {
+ 	bool		any_hope_left = false;
+ 	bool		useful_index_found = false;
+ 	ListCell	*l;
+ 
+  	/*
+ 	 * We're basically looking for an index-scan path over a unique index on a
+ 	 * suitable set of columns.  In the future, we might want to consider the
+ 	 * case of a LEFT JOIN against a grouped subselect where the grouped
+ 	 * columns are mergejoinable:
+ 	 *   SELECT c.* FROM customer c
+ 	 *     LEFT JOIN (SELECT customer_id, SUM(1) FROM customer_order
+ 	 *			GROUP BY 1) co ON c.id = co.customer_id
+ 	 */
+ 	foreach(l, innerrel->pathlist)
+ 	{
+ 		Path *path = lfirst(l);
+ 		/*
+ 		 * Since we're only handling LEFT joins for now, we don't need to
+ 		 * worry about whether the inner side of the join reduces the number
+ 		 * of output rows.  In general, that can only happen if there is a
+ 		 * qual on the inner rel - but the rel_attrs_needed_above_join test
+ 		 * will catch that case.
+ 		 *
+ 		 * We do need to worry about the join increasing the number of output
+ 		 * rows.  It suffices to find a unique index on all of the pathkeys for
+ 		 * ANY possible way of doing the merge join.  In other words, the
+ 		 * presence of non-merge-joinable join clauses isn't a problem - and
+ 		 * we don't need to use all of the clauses, either.  In such cases,
+ 		 * the unique index we're looking for will be stronger than necessary,
+ 		 * but that doesn't matter (since this is an outer join).
+ 		 *
+ 		 * This algorithm can miss a legal join removal if the set of pathkeys
+ 		 * that corresponds to the ordering of the relevant index is never
+ 		 * tried, but it doesn't seem worth spending too much time worrying
+ 		 * about that case for the same reasons mentioned in
+ 		 * sort_outer_and_inner.
+ 		 */
+ 		if (IsA(path, IndexPath)) {
+ 			IndexOptInfo *info = ((IndexPath *) path)->indexinfo;
+ 			if (!info->unique)
+ 				continue;
+ 			/*
+ 			 * What's going on here is a little bit subtle.  The only possible
+ 			 * source of pathkeys for the index path is the columns of the
+ 			 * index, so we know that the pathkeys are derived from the index
+ 			 * columns.  However, we don't know that they're all present in
+ 			 * the index-path's list of pathkeys, because the planner is free
+ 			 * to truncate the pathkey list if the lower-order terms aren't
+ 			 * useful to the current query.  So we must check that the lists
+ 			 * are of equal length.  (If they aren't, the unique index isn't
+ 			 * "unique enough", because it has lower-order terms.)
+ 			 */
+ 			if (info->ncolumns == list_length(path->pathkeys)
+ 				&& compare_pathkeys(inner_pathkeys_for_merge, path->pathkeys)
+ 					== PATHKEYS_EQUAL)
+ 			{
+ 				useful_index_found = true;
+ 				break;
+ 			}
+ 			/*
+ 			 * Since we found a unique index, there's still hope, even if this
+ 			 * set of pathkeys doesn't pan out.
+ 			 */
+ 			any_hope_left = true;
+ 		}
+ 		/*
+ 		 * In the future, we might want to consider other ways to prove that
+ 		 * the left join can't increase the row count, such as a LEFT JOIN
+ 		 * against a merge-joinable grouped subselect, like this:
+ 		 *   SELECT
+ 		 *     c.*
+ 		 *   FROM
+ 		 *     customer c
+ 		 *     LEFT JOIN (SELECT customer_id, SUM(1) FROM customer_order
+ 		 *				  GROUP BY 1) co
+ 		 *	   ON c.id = co.customer_id
+ 		 */
+ 	}
+ 
+ 	/*
+ 	 * We found a useful unique index.  But we still can't remove the join
+ 	 * unless the rel supplies no attrs.
+ 	 */
+ 	if (useful_index_found)
+ 	{
+ 		if (rel_attrs_needed_above_join(innerrel, joinrel))
+ 		{
+ 			/*
+ 			 * If the innerrel is supplying attributes, join removal fails,
+ 			 * and there's no point in rescanning with a different set of
+ 			 * pathkeys, so kill off any further attempts.
+ 			 */
+ 			*removejoinOK = false;
+ 			return false;
+ 		}
+ 		return true;
+ 	}
+ 
+ 	/* No luck. */
+ 	if (!any_hope_left)
+ 		*removejoinOK = false;
+ 	return false;
+ }
+ 
+ /*
+  * rel_attrs_needed_above_join
+  *     Determine whether any attributes from rel are used outside of joinrel.
+  *
+  * As a micro-optimization, it seems better to start with max_attr and count
+  * down rather than starting with min_attr and counting up, on the theory that
+  * the system attributes are somewhat less likely to be what is wanted and
+  * should be tested last.
+  */
+ static bool
+ rel_attrs_needed_above_join(RelOptInfo *rel, RelOptInfo *joinrel)
+ {
+ 	int attroff;
+ 	for (attroff = rel->max_attr - rel->min_attr; attroff >= 0; --attroff)
+ 		if (!bms_is_subset(rel->attr_needed[attroff], joinrel->relids))
+ 			return true;
+ 	return false;
+ }
*** a/src/backend/optimizer/util/pathnode.c
--- b/src/backend/optimizer/util/pathnode.c
***************
*** 340,346 **** add_path(RelOptInfo *parent_rel, Path *new_path)
  			/*
  			 * Delete the data pointed-to by the deleted cell, if possible
  			 */
! 			if (!IsA(old_path, IndexPath))
  				pfree(old_path);
  			/* Advance list pointer */
  			if (p1_prev)
--- 340,346 ----
  			/*
  			 * Delete the data pointed-to by the deleted cell, if possible
  			 */
! 			if (!IsA(old_path, IndexPath) && old_path->parent == parent_rel)
  				pfree(old_path);
  			/* Advance list pointer */
  			if (p1_prev)
***************
*** 378,388 **** add_path(RelOptInfo *parent_rel, Path *new_path)
  	else
  	{
  		/* Reject and recycle the new path */
! 		if (!IsA(new_path, IndexPath))
  			pfree(new_path);
  	}
  }
  
  
  /*****************************************************************************
   *		PATH NODE CREATION ROUTINES
--- 378,400 ----
  	else
  	{
  		/* Reject and recycle the new path */
! 		if (!IsA(new_path, IndexPath) && new_path->parent == parent_rel)
  			pfree(new_path);
  	}
  }
  
+ /*
+  * add_paths_from_list
+  *    Add each path from the list to parent_rel.
+  */
+ void
+ add_paths_from_list(RelOptInfo *parent_rel, List *pathlist)
+ {
+ 	ListCell *lc;
+ 
+ 	foreach (lc, pathlist)
+ 		add_path(parent_rel, (Path *) lfirst(lc));
+ }
  
  /*****************************************************************************
   *		PATH NODE CREATION ROUTINES
*** a/src/backend/utils/misc/guc.c
--- b/src/backend/utils/misc/guc.c
***************
*** 648,653 **** static struct config_bool ConfigureNamesBool[] =
--- 648,661 ----
  		true, NULL, NULL
  	},
  	{
+ 		{"enable_joinremoval", PGC_USERSET, QUERY_TUNING_METHOD,
+ 			gettext_noop("Enables the planner's use of join removal."),
+ 			NULL
+ 		},
+ 		&enable_joinremoval,
+ 		true, NULL, NULL
+ 	},
+ 	{
  		{"enable_hashjoin", PGC_USERSET, QUERY_TUNING_METHOD,
  			gettext_noop("Enables the planner's use of hash join plans."),
  			NULL
*** a/src/backend/utils/misc/postgresql.conf.sample
--- b/src/backend/utils/misc/postgresql.conf.sample
***************
*** 190,195 ****
--- 190,196 ----
  #enable_seqscan = on
  #enable_sort = on
  #enable_tidscan = on
+ #enable_joinremoval = on
  
  # - Planner Cost Constants -
  
*** a/src/include/optimizer/cost.h
--- b/src/include/optimizer/cost.h
***************
*** 59,64 **** extern bool enable_hashagg;
--- 59,65 ----
  extern bool enable_nestloop;
  extern bool enable_mergejoin;
  extern bool enable_hashjoin;
+ extern bool enable_joinremoval;
  extern int	constraint_exclusion;
  
  extern double clamp_row_est(double nrows);
*** a/src/include/optimizer/pathnode.h
--- b/src/include/optimizer/pathnode.h
***************
*** 26,31 **** extern int compare_fractional_path_costs(Path *path1, Path *path2,
--- 26,32 ----
  							  double fraction);
  extern void set_cheapest(RelOptInfo *parent_rel);
  extern void add_path(RelOptInfo *parent_rel, Path *new_path);
+ extern void add_paths_from_list(RelOptInfo *parent_rel, List *pathlist);
  
  extern Path *create_seqscan_path(PlannerInfo *root, RelOptInfo *rel);
  extern IndexPath *create_index_path(PlannerInfo *root,
*** a/src/test/regress/expected/rangefuncs.out
--- b/src/test/regress/expected/rangefuncs.out
***************
*** 1,16 ****
  SELECT name, setting FROM pg_settings WHERE name LIKE 'enable%';
!        name        | setting 
! -------------------+---------
!  enable_bitmapscan | on
!  enable_hashagg    | on
!  enable_hashjoin   | on
!  enable_indexscan  | on
!  enable_mergejoin  | on
!  enable_nestloop   | on
!  enable_seqscan    | on
!  enable_sort       | on
!  enable_tidscan    | on
! (9 rows)
  
  CREATE TABLE foo2(fooid int, f2 int);
  INSERT INTO foo2 VALUES(1, 11);
--- 1,17 ----
  SELECT name, setting FROM pg_settings WHERE name LIKE 'enable%';
!         name        | setting 
! --------------------+---------
!  enable_bitmapscan  | on
!  enable_hashagg     | on
!  enable_hashjoin    | on
!  enable_indexscan   | on
!  enable_joinremoval | on
!  enable_mergejoin   | on
!  enable_nestloop    | on
!  enable_seqscan     | on
!  enable_sort        | on
!  enable_tidscan     | on
! (10 rows)
  
  CREATE TABLE foo2(fooid int, f2 int);
  INSERT INTO foo2 VALUES(1, 11);

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] join removal

Reply via email to