Re: [HACKERS] Performance improvement for joins where outer side is unique

Kyotaro HORIGUCHI Tue, 24 Mar 2015 05:12:43 -0700

Hi, thanks for the new patch.

I made an additional shrink from your last one. Do you have a
look on the attached?


At Sun, 22 Mar 2015 19:42:21 +1300, David Rowley <dgrowle...@gmail.com> wrote 
in <caaphdvrkwmmtwkxfn4uazyza9jql1c7uwbjbtuwfr69rqlv...@mail.gmail.com>
> On 20 March 2015 at 21:11, David Rowley <dgrowle...@gmail.com> wrote:
> >
> > I can continue working on your patch if you like? Or are you planning to
> > go further with it?

It's fine that you continue to work on this.

# Sorry for the hardly baked patch which had left many things alone:(

> I've been working on this more over the weekend and I've re-factored things
> to allow LEFT JOINs to be properly marked as unique.
> I've also made changes to re-add support for detecting the uniqueness of
> sub-queries.

I don't see the point of calling mark_unique_joins for every
iteration on join_info_list in remove_useless_joins. 

The loop already iteraltes on whole the join_info_list so
mark_unique_join as an individual function is needless.

Finally, simply marking uniqueness of join in join_is_removable
seems to be enough, inhibiting early bailing out by the checks on
attribute usage and placeholder let it work as expected.

Reducing changes to this extent, I can easily see what is added
to planning computations. It consists of mainly two factors.

- Unique-join chekcs for every candidate inner joins in
  add_paths_to_joinrel.

- Uniqueness check of mergejoinable clause in join-removability
  check for every left join, some of which would be skipped
  by other checks before.


> Also, I've added modified the costing for hash and nested loop joins to
> reduce the cost for unique inner joins to cost the join the same as it does
> for SEMI joins. This has tipped the scales on a few plans in the regression
> tests.

I've forgotten it, but quite important.

> Also, please see attached unijoin_analysis.patch. This just adds some code
> which spouts out notices when join nodes are initialised which states if
> the join is unique or not. Running the regression tests with this patch in
> places gives:
> 
> Unique Inner: Yes == 753 hits
> Unique Inner: No == 1430 hits
> 
> So it seems we can increase the speed of about 1 third of joins by about
> 10%.
> A quick scan of the "No"s seems to show quite a few cases which do not look
> that real world like. e.g cartesian join.

I don't have an idea how many queries in the reality hit this but
I suppose it's not a few.

> It would be great if someone could run some PostgreSQL application with
> these 2 patches applied, and then grep the logs for the Unique Inner
> results... Just to get a better idea of how many joins in a real world case
> will benefit from this patch.

Wow. I think the second patch should be DEBUGx, not NOTICE:)


regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

>From 659ae5c80ea9e6dc9f6f0a16a43db66b7329a2be Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyot...@lab.ntt.co.jp>
Date: Tue, 24 Mar 2015 20:59:48 +0900
Subject: [PATCH] Removing individual phase to mark unique joins.

mark_unique_joins() makes redundant loops with remove_useless_joins()
over join_info_list. So removed the function and moved the core steps
into join_is_removable().
---
 src/backend/optimizer/path/joinpath.c     |   3 +-
 src/backend/optimizer/plan/analyzejoins.c | 247 +++++++++++-------------------
 src/backend/optimizer/plan/planmain.c     |   3 -
 src/include/nodes/relation.h              |   1 +
 src/include/optimizer/planmain.h          |   1 -
 5 files changed, 96 insertions(+), 159 deletions(-)

diff --git a/src/backend/optimizer/path/joinpath.c b/src/backend/optimizer/path/joinpath.c
index b0bc3f6..909b585 100644
--- a/src/backend/optimizer/path/joinpath.c
+++ b/src/backend/optimizer/path/joinpath.c
@@ -296,7 +296,8 @@ add_paths_to_joinrel(PlannerInfo *root,
 	}
 
 	/*
-	 * left joins were already checked for uniqueness in analyzejoins.c
+	 * left joins might be already checked for uniqueness in
+	 * remove_useless_joins()
 	 */
 
 	if (sjinfo->is_unique_join)
diff --git a/src/backend/optimizer/plan/analyzejoins.c b/src/backend/optimizer/plan/analyzejoins.c
index 5d0cd2e..990f3b3 100644
--- a/src/backend/optimizer/plan/analyzejoins.c
+++ b/src/backend/optimizer/plan/analyzejoins.c
@@ -33,28 +33,12 @@
 
 /* local functions */
 static bool join_is_removable(PlannerInfo *root, SpecialJoinInfo *sjinfo);
-static bool specialjoin_is_unique_join(PlannerInfo *root,
-					  SpecialJoinInfo *sjinfo);
 static void remove_rel_from_query(PlannerInfo *root, int relid,
 					  Relids joinrelids);
 static List *remove_rel_from_joinlist(List *joinlist, int relid, int *nremoved);
 static Oid	distinct_col_search(int colno, List *colnos, List *opids);
 
 
-void
-mark_unique_joins(PlannerInfo *root, List *joinlist)
-{
-	ListCell   *lc;
-
-	foreach(lc, root->join_info_list)
-	{
-		SpecialJoinInfo *sjinfo = (SpecialJoinInfo *) lfirst(lc);
-
-		if (specialjoin_is_unique_join(root, sjinfo))
-			sjinfo->is_unique_join = true;
-	}
-}
-
 /*
  * remove_useless_joins
  *		Check for relations that don't actually need to be joined at all,
@@ -107,12 +91,6 @@ restart:
 		root->join_info_list = list_delete_ptr(root->join_info_list, sjinfo);
 
 		/*
-		 * We may now be able to mark some joins as unique which we could
-		 * not do before
-		 */
-		mark_unique_joins(root, joinlist);
-
-		/*
 		 * Restart the scan.  This is necessary to ensure we find all
 		 * removable joins independently of ordering of the join_info_list
 		 * (note that removal of attr_needed bits may make a join appear
@@ -173,17 +151,18 @@ join_is_removable(PlannerInfo *root, SpecialJoinInfo *sjinfo)
 {
 	int			innerrelid;
 	RelOptInfo *innerrel;
+	Query	   *subquery = NULL;
 	Relids		joinrelids;
-	int			attroff;
+	List	   *clause_list = NIL;
 	ListCell   *l;
+	int			attroff;
+	bool		removable = true;
 
 	/*
-	 * Join must not duplicate its outer side and must be a non-delaying left
-	 * join to a single baserel, else we aren't going to be able to do anything
-	 * with it.
+	 * Must be a non-delaying left join to a single baserel, else we aren't
+	 * going to be able to do anything with it.
 	 */
-	if (!sjinfo->is_unique_join ||
-		sjinfo->jointype != JOIN_LEFT ||
+	if (sjinfo->jointype != JOIN_LEFT ||
 		sjinfo->delay_upper_joins)
 		return false;
 
@@ -192,98 +171,38 @@ join_is_removable(PlannerInfo *root, SpecialJoinInfo *sjinfo)
 
 	innerrel = find_base_rel(root, innerrelid);
 
-	Assert(innerrel->reloptkind == RELOPT_BASEREL);
-
-	/* Compute the relid set for the join we are considering */
-	joinrelids = bms_union(sjinfo->min_lefthand, sjinfo->min_righthand);
+	if (innerrel->reloptkind != RELOPT_BASEREL)
+		return false;
 
 	/*
-	 * We can't remove the join if any inner-rel attributes are used above the
-	 * join.
-	 *
-	 * Note that this test only detects use of inner-rel attributes in higher
-	 * join conditions and the target list.  There might be such attributes in
-	 * pushed-down conditions at this join, too, but in this case the join
-	 * would not have been marked as unique.
-	 *
-	 * As a micro-optimization, it seems better to start with max_attr and
-	 * count down rather than starting with min_attr and counting up, on the
-	 * theory that the system attributes are somewhat less likely to be wanted
-	 * and should be tested last.
+	 * Before we go to the effort of checking whether any innerrel variables
+	 * are needed above the join, make a quick check to eliminate cases in
+	 * which we will surely be unable to prove uniqueness of the innerrel.
 	 */
-	for (attroff = innerrel->max_attr - innerrel->min_attr;
-		 attroff >= 0;
-		 attroff--)
+	if (innerrel->rtekind == RTE_RELATION)
 	{
-		if (!bms_is_subset(innerrel->attr_needed[attroff], joinrelids))
+		/*
+		 * For a plain-relation innerrel, we only know how to prove uniqueness
+		 * by reference to unique indexes.  If there are no indexes then
+		 * there's certainly no unique indexes so there's no point in going
+		 * further.
+		 */
+		if (innerrel->indexlist == NIL)
 			return false;
 	}
-
-	/*
-	 * Similarly check that the inner rel isn't needed by any PlaceHolderVars
-	 * that will be used above the join.  We only need to fail if such a PHV
-	 * actually references some inner-rel attributes; but the correct check
-	 * for that is relatively expensive, so we first check against ph_eval_at,
-	 * which must mention the inner rel if the PHV uses any inner-rel attrs as
-	 * non-lateral references.  Note that if the PHV's syntactic scope is just
-	 * the inner rel, we can't drop the rel even if the PHV is variable-free.
-	 */
-	foreach(l, root->placeholder_list)
+	else if (innerrel->rtekind == RTE_SUBQUERY)
 	{
-		PlaceHolderInfo *phinfo = (PlaceHolderInfo *) lfirst(l);
+		subquery = root->simple_rte_array[innerrelid]->subquery;
 
-		if (bms_is_subset(phinfo->ph_needed, joinrelids))
-			continue;			/* PHV is not used above the join */
-		if (bms_overlap(phinfo->ph_lateral, innerrel->relids))
-			return false;		/* it references innerrel laterally */
-		if (!bms_overlap(phinfo->ph_eval_at, innerrel->relids))
-			continue;			/* it definitely doesn't reference innerrel */
-		if (bms_is_subset(phinfo->ph_eval_at, innerrel->relids))
-			return false;		/* there isn't any other place to eval PHV */
-		if (bms_overlap(pull_varnos((Node *) phinfo->ph_var->phexpr),
-						innerrel->relids))
-			return false;		/* it does reference innerrel */
+		/*
+		 * If the subquery has no qualities that support distinctness proofs
+		 * then there's no point in going further.
+		 */
+		if (!query_supports_distinctness(subquery))
+			return false;
 	}
-
-	return true;
-}
-
-/*
- * specialjoin_is_unique_join
- *		True if it can be proved that this special join will never produce
- *		more than 1 row per outer row, otherwise returns false if there is
- *		insufficient evidence to prove the join is unique.
- */
-static bool
-specialjoin_is_unique_join(PlannerInfo *root, SpecialJoinInfo *sjinfo)
-{
-	int			innerrelid;
-	RelOptInfo *innerrel;
-	Query	   *subquery = NULL;
-	Relids		joinrelids;
-	ListCell   *l;
-	List	   *clause_list = NIL;
-
-	/* check if we've already marked this join as unique on a previous call */
-	if (sjinfo->is_unique_join)
-		return true;
-
-	/* if there's more than 1 relation involved then punt */
-	if (!bms_get_singleton_member(sjinfo->min_righthand, &innerrelid))
-		return false;
-
-	innerrel = find_base_rel(root, innerrelid);
-
-	if (innerrel->reloptkind != RELOPT_BASEREL)
-		return false;
-
-	/*
-	 * Before we go to the effort of pulling out the join condition's columns,
-	 * make a quick check to eliminate cases in which we will surely be unable
-	 * to prove uniqueness of the innerrel.
-	 */
-	if (!rel_supports_distinctness(root, innerrel))
-		return false;
+	else
+		return false;			/* unsupported rtekind */
 
 	/* Compute the relid set for the join we are considering */
 	joinrelids = bms_union(sjinfo->min_lefthand, sjinfo->min_righthand);
@@ -293,7 +212,9 @@ specialjoin_is_unique_join(PlannerInfo *root, SpecialJoinInfo *sjinfo)
 	 * either the outer rel or a pseudoconstant.  If an operator is
 	 * mergejoinable then it behaves like equality for some btree opclass, so
 	 * it's what we want.  The mergejoinability test also eliminates clauses
-	 * containing volatile functions, which we couldn't depend on.
+	 * containing volatile functions, which we couldn't depend on.  The
+	 * uniquness might differ from the previous call on the same innerrel so
+	 * always check it regardless of the previous result.
 	 */
 	foreach(l, innerrel->joininfo)
 	{
@@ -309,8 +230,10 @@ specialjoin_is_unique_join(PlannerInfo *root, SpecialJoinInfo *sjinfo)
 			!bms_equal(restrictinfo->required_relids, joinrelids))
 		{
 			/*
-			 * If such a clause actually references the inner rel then we can't
-			 * mark the join as unique.
+			 * If such a clause actually references the inner rel then join
+			 * removal has to be disallowed.  We have to check this despite
+			 * the previous attr_needed checks because of the possibility of
+			 * pushed-down clauses referencing the rel.
 			 */
 			if (bms_is_member(innerrelid, restrictinfo->clause_relids))
 				return false;
@@ -333,10 +256,63 @@ specialjoin_is_unique_join(PlannerInfo *root, SpecialJoinInfo *sjinfo)
 		clause_list = lappend(clause_list, restrictinfo);
 	}
 
-	if (rel_is_distinct_for(root, innerrel, clause_list))
-		return true;
+	/*
+	 * Some day it would be nice to check for other methods of establishing
+	 * distinctness.
+	 */
+	sjinfo->is_unique_join = rel_is_distinct_for(root, innerrel, clause_list);
 
-	return false;
+	if (!sjinfo->is_unique_join)
+		return false;
+
+	/*
+	 * We can't remove the join if any inner-rel attributes are used above the
+	 * join.
+	 *
+	 * Note that this test only detects use of inner-rel attributes in higher
+	 * join conditions and the target list.  There might be such attributes in
+	 * pushed-down conditions at this join, too.  We check that case below.
+	 *
+	 * As a micro-optimization, it seems better to start with max_attr and
+	 * count down rather than starting with min_attr and counting up, on the
+	 * theory that the system attributes are somewhat less likely to be wanted
+	 * and should be tested last.
+	 */
+	for (attroff = innerrel->max_attr - innerrel->min_attr;
+		 attroff >= 0 && removable;
+		 attroff--)
+	{
+		if (!bms_is_subset(innerrel->attr_needed[attroff], joinrelids))
+			return false;
+	}
+
+	/*
+	 * Similarly check that the inner rel isn't needed by any PlaceHolderVars
+	 * that will be used above the join.  We only need to fail if such a PHV
+	 * actually references some inner-rel attributes; but the correct check
+	 * for that is relatively expensive, so we first check against ph_eval_at,
+	 * which must mention the inner rel if the PHV uses any inner-rel attrs as
+	 * non-lateral references.  Note that if the PHV's syntactic scope is just
+	 * the inner rel, we can't drop the rel even if the PHV is variable-free.
+	 */
+	foreach(l, root->placeholder_list)
+	{
+		PlaceHolderInfo *phinfo = (PlaceHolderInfo *) lfirst(l);
+
+		if (bms_is_subset(phinfo->ph_needed, joinrelids))
+			continue;			/* PHV is not used above the join */
+		if (bms_overlap(phinfo->ph_lateral, innerrel->relids))
+			return false;		/* it references innerrel laterally */
+		if (!bms_overlap(phinfo->ph_eval_at, innerrel->relids))
+			continue;			/* it definitely doesn't reference innerrel */
+		if (bms_is_subset(phinfo->ph_eval_at, innerrel->relids))
+			return false;		/* there isn't any other place to eval PHV */
+		if (bms_overlap(pull_varnos((Node *) phinfo->ph_var->phexpr),
+						innerrel->relids))
+			return false;		/* it does reference innerrel */
+	}
+
+	return true;
 }
 
 
@@ -617,43 +593,6 @@ rel_is_distinct_for(PlannerInfo *root, RelOptInfo *rel, List *clause_list)
 	}
 	return false; /* can't prove rel to be distinct over clause_list */
 }
-/*
- * rel_supports_distinctness
- *		Returns true if rel has some properties which can prove the relation
- *		to be unique over some set of columns.
- *
- * This is effectively a pre-checking function for rel_is_distinct_for().
- * It must return TRUE if rel_is_distinct_for() could possibly return TRUE
- */
-bool
-rel_supports_distinctness(PlannerInfo *root, RelOptInfo *rel)
-{
-	if (rel->rtekind == RTE_RELATION)
-	{
-		/*
-		 * For a plain-relation, we only know how to prove uniqueness
-		 * by reference to unique indexes.  If there are no indexes then
-		 * there's certainly no unique indexes so there's nothing to prove
-		 * uniqueness on the relation.
-		 */
-		if (rel->indexlist != NIL)
-			return true;
-	}
-	else if (rel->rtekind == RTE_SUBQUERY)
-	{
-		Query *subquery = root->simple_rte_array[rel->relid]->subquery;
-
-		/* Check if the subquery has any qualities that support distinctness */
-		if (query_supports_distinctness(subquery))
-			return true;
-	}
-
-	/*
-	 * Some day it would be nice to check for other methods of establishing
-	 * distinctness.
-	 */
-	return false;
-}
 
 /*
  * query_supports_distinctness - could the query possibly be proven distinct
diff --git a/src/backend/optimizer/plan/planmain.c b/src/backend/optimizer/plan/planmain.c
index 55310d8..848df97 100644
--- a/src/backend/optimizer/plan/planmain.c
+++ b/src/backend/optimizer/plan/planmain.c
@@ -174,9 +174,6 @@ query_planner(PlannerInfo *root, List *tlist,
 	 */
 	fix_placeholder_input_needed_levels(root);
 
-	/* Analyze joins to find out which ones have a unique inner side */
-	mark_unique_joins(root, joinlist);
-
 	/*
 	 * Remove any useless outer joins.  Ideally this would be done during
 	 * jointree preprocessing, but the necessary information isn't available
diff --git a/src/include/nodes/relation.h b/src/include/nodes/relation.h
index 776a269..b7cecb6 100644
--- a/src/include/nodes/relation.h
+++ b/src/include/nodes/relation.h
@@ -1407,6 +1407,7 @@ typedef struct SpecialJoinInfo
 	JoinType	jointype;		/* always INNER, LEFT, FULL, SEMI, or ANTI */
 	bool		lhs_strict;		/* joinclause is strict for some LHS rel */
 	bool		delay_upper_joins;		/* can't commute with upper RHS */
+	List	   *mergejoinable_clauses; /* cache of mergejoinable clause */
 	bool		is_unique_join; /* matches a max of 1 row per outer join row */
 	/* Remaining fields are set only for JOIN_SEMI jointype: */
 	bool		semi_can_btree; /* true if semi_operators are all btree */
diff --git a/src/include/optimizer/planmain.h b/src/include/optimizer/planmain.h
index 7a85227..bd1a709 100644
--- a/src/include/optimizer/planmain.h
+++ b/src/include/optimizer/planmain.h
@@ -126,7 +126,6 @@ extern void mark_unique_joins(PlannerInfo *root, List *joinlist);
 extern List *remove_useless_joins(PlannerInfo *root, List *joinlist);
 extern bool rel_is_distinct_for(PlannerInfo *root, RelOptInfo *rel,
 								List *clause_list);
-extern bool rel_supports_distinctness(PlannerInfo *root, RelOptInfo *rel);
 extern bool query_supports_distinctness(Query *query);
 extern bool query_is_distinct_for(Query *query, List *colnos, List *opids);
 
-- 
2.1.0.GIT

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Performance improvement for joins where outer side is unique

Reply via email to