Re: [HACKERS] [PATCH] Simplify EXISTS subqueries containing LIMIT

Marti Raudsepp Sun, 26 Oct 2014 05:57:30 -0700

On Wed, Oct 22, 2014 at 1:37 PM, David Rowley <dgrowle...@gmail.com> wrote:
> I've had a bit of a look at this and here's a couple of things:


Thanks. Sorry I didn't back to you earlier, I almost forgot about the review.

>         /*
> +        * LIMIT clause can be removed if it's a positive constant or ALL, to
> +        * prevent it from being an optimization barrier. It's a common meme 
> to put
> +        * LIMIT 1 within EXISTS subqueries.
> +        */
>
> I think this comment may be better explained along the lines of:
>
> "A subquery which has a LIMIT clause with a positive value is effectively a
> no-op in this scenario. With EXISTS we only care about the first row anyway,
> so any positive limit value will have no behavioral change to the query, so
> we'll simply remove the LIMIT clause. If we're unable to prove that the
> LIMIT value is a positive number then we'd better not touch it."

That's a bit redundant. Here's what I wrote instead, does this sound better?

* A subquery that has a LIMIT clause with a positive value or NULL causes
* no behavioral change to the query. With EXISTS we only care about the
* first row anyway, so we'll simply remove the LIMIT clause. If the LIMIT
* value does not reduce to a constant that's positive or NULL then do not
* touch it.

> + /* Checking for negative values is done later; 0 is just silly */
> + if (! limit->constisnull && DatumGetInt64(limit->constvalue) <= 0)
>
> I'm a bit confused by the comment here. You're saying that we'll check for
> negatives later... but you're checking for <= 0 on the next line... Did I
> miss something or is this a mistake?

I meant that the case when a negative value raises an error is checked
later in the execution pass. But you're right, this code has nothing
to do with that so I've removed the mention about negative values. It
now says:

/* If it's not NULL and not positive, keep it as a regular subquery */

> I guess here you're just testing to ensure that you've not broken this and
> that the new code does not accidentally remove the LIMIT clause. I think it
> also might be worth adding some tests to ensure that co-related EXISTS
> clauses with a positive LIMIT value get properly changed into a SEMI JOIN
> instead of remaining as a subplan. And also make sure that a LIMIT 0 or
> LIMIT -1 gets left as a subplan. I'd say such a test would supersede the
> above test, and I think it's also the case you were talking about optimising
> anyway?

Sure, this is a planner-only change so it makes sense to test EXPLAIN output.

The dilemma I always have is: wouldn't this cause failures due to plan
instability, if for example autoanalyze gets around to this table
earlier or later and nudges it from a hash join to merge etc? Or is
this not a problem?

Patch attached with all above changes.

Regards,
Marti

From 28543dda9febe8d8b5fc91060a4323c08f3c4a8a Mon Sep 17 00:00:00 2001
From: Marti Raudsepp <ma...@juffo.org>
Date: Wed, 1 Oct 2014 02:17:21 +0300
Subject: [PATCH] Simplify EXISTS subqueries containing LIMIT

---
 src/backend/optimizer/plan/subselect.c  | 42 ++++++++++++++++++++++-----
 src/test/regress/expected/subselect.out | 51 +++++++++++++++++++++++++++++++++
 src/test/regress/sql/subselect.sql      | 14 +++++++++
 3 files changed, 100 insertions(+), 7 deletions(-)

diff --git a/src/backend/optimizer/plan/subselect.c b/src/backend/optimizer/plan/subselect.c
index 3e7dc85..66d1b90 100644
--- a/src/backend/optimizer/plan/subselect.c
+++ b/src/backend/optimizer/plan/subselect.c
@@ -70,7 +70,7 @@ static Node *convert_testexpr_mutator(Node *node,
 static bool subplan_is_hashable(Plan *plan);
 static bool testexpr_is_hashable(Node *testexpr);
 static bool hash_ok_operator(OpExpr *expr);
-static bool simplify_EXISTS_query(Query *query);
+static bool simplify_EXISTS_query(PlannerInfo *root, Query *query);
 static Query *convert_EXISTS_to_ANY(PlannerInfo *root, Query *subselect,
 					  Node **testexpr, List **paramIds);
 static Node *replace_correlation_vars_mutator(Node *node, PlannerInfo *root);
@@ -452,7 +452,7 @@ make_subplan(PlannerInfo *root, Query *orig_subquery,
 	 * If it's an EXISTS subplan, we might be able to simplify it.
 	 */
 	if (subLinkType == EXISTS_SUBLINK)
-		simple_exists = simplify_EXISTS_query(subquery);
+		simple_exists = simplify_EXISTS_query(root, subquery);
 
 	/*
 	 * For an EXISTS subplan, tell lower-level planner to expect that only the
@@ -518,7 +518,7 @@ make_subplan(PlannerInfo *root, Query *orig_subquery,
 		/* Make a second copy of the original subquery */
 		subquery = (Query *) copyObject(orig_subquery);
 		/* and re-simplify */
-		simple_exists = simplify_EXISTS_query(subquery);
+		simple_exists = simplify_EXISTS_query(root, subquery);
 		Assert(simple_exists);
 		/* See if it can be converted to an ANY query */
 		subquery = convert_EXISTS_to_ANY(root, subquery,
@@ -1359,7 +1359,7 @@ convert_EXISTS_sublink_to_join(PlannerInfo *root, SubLink *sublink,
 	 * targetlist, we have to fail, because the pullup operation leaves us
 	 * with noplace to evaluate the targetlist.
 	 */
-	if (!simplify_EXISTS_query(subselect))
+	if (!simplify_EXISTS_query(root, subselect))
 		return NULL;
 
 	/*
@@ -1486,11 +1486,11 @@ convert_EXISTS_sublink_to_join(PlannerInfo *root, SubLink *sublink,
  * Returns TRUE if was able to discard the targetlist, else FALSE.
  */
 static bool
-simplify_EXISTS_query(Query *query)
+simplify_EXISTS_query(PlannerInfo *root, Query *query)
 {
 	/*
 	 * We don't try to simplify at all if the query uses set operations,
-	 * aggregates, modifying CTEs, HAVING, LIMIT/OFFSET, or FOR UPDATE/SHARE;
+	 * aggregates, modifying CTEs, HAVING, OFFSET, or FOR UPDATE/SHARE;
 	 * none of these seem likely in normal usage and their possible effects
 	 * are complex.
 	 */
@@ -1501,7 +1501,6 @@ simplify_EXISTS_query(Query *query)
 		query->hasModifyingCTE ||
 		query->havingQual ||
 		query->limitOffset ||
-		query->limitCount ||
 		query->rowMarks)
 		return false;
 
@@ -1513,6 +1512,35 @@ simplify_EXISTS_query(Query *query)
 		return false;
 
 	/*
+	 * A subquery that has a LIMIT clause with a positive value or NULL causes
+	 * no behavioral change to the query. With EXISTS we only care about the
+	 * first row anyway, so we'll simply remove the LIMIT clause. If the LIMIT
+	 * value does not reduce to a constant that's positive or NULL then do not
+	 * touch it.
+	 */
+	if (query->limitCount)
+	{
+		/*
+		 * eval_const_expressions has not been called yet by subquery_planner,
+		 * may still contain int64 coercions etc.
+		 */
+		Node	   *node = eval_const_expressions(root, query->limitCount);
+		Const	   *limit;
+
+		if (! IsA(node, Const))
+			return false;
+
+		limit = (Const *) node;
+		Assert(limit->consttype == INT8OID);
+
+		/* If it's not NULL and not positive, keep it as a regular subquery */
+		if (!limit->constisnull && DatumGetInt64(limit->constvalue) <= 0)
+			return false;
+
+		query->limitCount = NULL;
+	}
+
+	/*
 	 * Otherwise, we can throw away the targetlist, as well as any GROUP,
 	 * WINDOW, DISTINCT, and ORDER BY clauses; none of those clauses will
 	 * change a nonzero-rows result to zero rows or vice versa.  (Furthermore,
diff --git a/src/test/regress/expected/subselect.out b/src/test/regress/expected/subselect.out
index 01c9130..e3b871b 100644
--- a/src/test/regress/expected/subselect.out
+++ b/src/test/regress/expected/subselect.out
@@ -775,6 +775,57 @@ select * from int4_tbl o where (f1, f1) in
 (1 row)
 
 --
+-- Check EXISTS simplification with LIMIT
+--
+explain (costs off)
+select * from int4_tbl o where exists
+  (select 1 from int4_tbl i where i.f1=o.f1 limit null);
+             QUERY PLAN             
+------------------------------------
+ Hash Semi Join
+   Hash Cond: (o.f1 = i.f1)
+   ->  Seq Scan on int4_tbl o
+   ->  Hash
+         ->  Seq Scan on int4_tbl i
+(5 rows)
+
+explain (costs off)
+select * from int4_tbl o where not exists
+  (select 1 from int4_tbl i where i.f1=o.f1 limit 1);
+             QUERY PLAN             
+------------------------------------
+ Hash Anti Join
+   Hash Cond: (o.f1 = i.f1)
+   ->  Seq Scan on int4_tbl o
+   ->  Hash
+         ->  Seq Scan on int4_tbl i
+(5 rows)
+
+explain (costs off)
+select * from int4_tbl o where exists (select 1 limit -1);
+          QUERY PLAN          
+------------------------------
+ Result
+   One-Time Filter: $0
+   InitPlan 1 (returns $0)
+     ->  Limit
+           ->  Result
+   ->  Seq Scan on int4_tbl o
+(6 rows)
+
+explain (costs off)
+select * from int4_tbl o where exists (select 1 limit 0);
+          QUERY PLAN          
+------------------------------
+ Result
+   One-Time Filter: $0
+   InitPlan 1 (returns $0)
+     ->  Limit
+           ->  Result
+   ->  Seq Scan on int4_tbl o
+(6 rows)
+
+--
 -- check for over-optimization of whole-row Var referencing an Append plan
 --
 select (select q from
diff --git a/src/test/regress/sql/subselect.sql b/src/test/regress/sql/subselect.sql
index 56707e2..3f04137 100644
--- a/src/test/regress/sql/subselect.sql
+++ b/src/test/regress/sql/subselect.sql
@@ -433,6 +433,20 @@ select * from int4_tbl o where (f1, f1) in
   (select f1, generate_series(1,2) / 10 g from int4_tbl i group by f1);
 
 --
+-- Check EXISTS simplification with LIMIT
+--
+explain (costs off)
+select * from int4_tbl o where exists
+  (select 1 from int4_tbl i where i.f1=o.f1 limit null);
+explain (costs off)
+select * from int4_tbl o where not exists
+  (select 1 from int4_tbl i where i.f1=o.f1 limit 1);
+explain (costs off)
+select * from int4_tbl o where exists (select 1 limit -1);
+explain (costs off)
+select * from int4_tbl o where exists (select 1 limit 0);
+
+--
 -- check for over-optimization of whole-row Var referencing an Append plan
 --
 select (select q from
-- 
2.1.2

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [PATCH] Simplify EXISTS subqueries containing LIMIT

Reply via email to