Avoid unnecessary processing of DISTINCT clause (Was: Unique Keys)

Antonin Houska Wed, 19 Nov 2025 10:42:46 -0800

[1] uses bottom-up approach to check if scan/join path already happens to
produce distinct rows, in order to avoid the DISTINCT step of planning.  It
does so by conveying the uniqueness information ("unique keys") from
individual table/index scans to the upper planner. The concept is similar to
"path keys", which describe ordering of the path output. However, unlike "path
keys", the "unique keys" do not participate (AFAICS) in planning decisions in
the scan/join planning.


Here I try to introduce a top-down approach and deduce the path distinctness
from the existing information in the path tree. As the patch does not add any
information that the planner would have to propagate from lower to upper
nodes, it's less invasive than [1]. (Unlike [1], I haven't implemented the
"single row optimization" yet, but I see no reason why it shouldn't be
possible.)

For index / scan path, the patch obviously uses unique indexes as the "source
of uniqueness". For joins, it relies on the JoinPath.inner_unique field. The
theory is that a join produces unique rows if the 1) outer path does and 2) no
more than one inner row matches each outer row (i.e. the inner path does not
"duplicate" the outer rows). For more details, please see the commit message
and the patch itself.

Finally it's checked whether the final scan/join output would be unique even
if it contained only the DISTINCT expressions. If it does, no additional
processing (such as UniquePath or AggPath) is needed.

Is anything wrong about this approach, whether conceptually or in details?

[1] https://www.postgresql.org/message-id/7mlamswjp81p.fsf%40e18c07352.et15sqa

-- 
Antonin Houska
Web: https://www.cybertec-postgresql.com

>From 7715f28ade15976e6da12adba3e99a0ec016cd99 Mon Sep 17 00:00:00 2001
From: Antonin Houska <[email protected]>
Date: Wed, 19 Nov 2025 10:58:28 +0100
Subject: [PATCH] Avoid DISTINCT step if the input path already generates
 distinct rows.

This patch teaches the planner to recognize if the output of the scan/join
path happens to satisfy the DISTINCT clause of the query. If it does, paths
like UniquePath or AggPath do not have to be applied to it.

The problem can be addressed by adding uniqueness information ("unique keys")
to the table / index scan paths and propagate it to the upper paths
(bottom-up), like we do with ordering information ("path keys"). However it's
not quite clear how those "unique keys" should participate in planning
decisions, which are already non-trivial. Therefore it doesn't seem worth
adding extra information to Path and subclasses.

This patch handles the problem the other way round (top-down): the scan/join
path at the top of the tree is checked, and if it has subpaths, the check
includes recursion into these.

* Index path is considered distinct if the index is unique and if the index
  keys are a subset of the DISTINCT expressions referencing the index's parent
  table. No DISTINCT expression may reference a nullable column.

* Sequential scan path is considered distinct if the underlying table has at
  least one index for which a distinct index path can be created (see
  above). No DISTINCT expression may reference a nullable column.

* Unique path is considered distinct if its keys are a subset of the DISTINCT
  expressions referencing the input path.

* Join path is considered distinct if 1) the expressions belonging to the
  outer path form unique rows and 2) there is no more than one inner row per
  output row.

* Paths that do not affect uniqueness are considered distinct if their input
  paths are so.

The existing planner infrastructure is used, i.e. no fields are added to Path
and subclasses.
---
 src/backend/optimizer/path/indxpath.c     | 242 +++++++++++++++-------
 src/backend/optimizer/plan/analyzejoins.c |  19 ++
 src/backend/optimizer/plan/planner.c      | 241 ++++++++++++++++++++-
 src/include/optimizer/paths.h             |   8 +
 src/test/regress/expected/distinct.out    | 117 +++++++++++
 src/test/regress/parallel_schedule        |   2 +
 src/test/regress/sql/distinct.sql         |  54 +++++
 7 files changed, 607 insertions(+), 76 deletions(-)
 create mode 100644 src/test/regress/expected/distinct.out
 create mode 100644 src/test/regress/sql/distinct.sql

diff --git a/src/backend/optimizer/path/indxpath.c b/src/backend/optimizer/path/indxpath.c
index c62e3f87724..86d2106c70b 100644
--- a/src/backend/optimizer/path/indxpath.c
+++ b/src/backend/optimizer/path/indxpath.c
@@ -28,6 +28,7 @@
 #include "nodes/makefuncs.h"
 #include "nodes/nodeFuncs.h"
 #include "nodes/supportnodes.h"
+#include "optimizer/clauses.h"
 #include "optimizer/cost.h"
 #include "optimizer/optimizer.h"
 #include "optimizer/pathnode.h"
@@ -4137,11 +4138,29 @@ relation_has_unique_index_for(PlannerInfo *root, RelOptInfo *rel,
 							  List *restrictlist, List **extra_clauses)
 {
 	ListCell   *ic;
+	List	*exprs = NIL;
 
 	/* Short-circuit if no indexes... */
 	if (rel->indexlist == NIL)
 		return false;
 
+	/*
+	 * Retrieve expressions from the join clauses. ->outer_is_left should
+	 * already be set for these.
+	 */
+	foreach(ic, restrictlist)
+	{
+		RestrictInfo *restrictinfo = (RestrictInfo *) lfirst(ic);
+		Node	*expr;
+
+		/* Pick the expression we'll try to match to index. */
+		if (restrictinfo->outer_is_left)
+			expr = get_rightop(restrictinfo->clause);
+		else
+			expr = get_leftop(restrictinfo->clause);
+		exprs = lappend(exprs, expr);
+	}
+
 	/*
 	 * Examine the rel's restriction clauses for usable var = const clauses
 	 * that we can add to the restrictlist.
@@ -4149,6 +4168,7 @@ relation_has_unique_index_for(PlannerInfo *root, RelOptInfo *rel,
 	foreach(ic, rel->baserestrictinfo)
 	{
 		RestrictInfo *restrictinfo = (RestrictInfo *) lfirst(ic);
+		Node	   *expr = NULL;
 
 		/*
 		 * Note: can_join won't be set for a restriction clause, but
@@ -4165,18 +4185,28 @@ relation_has_unique_index_for(PlannerInfo *root, RelOptInfo *rel,
 		if (bms_is_empty(restrictinfo->left_relids))
 		{
 			/* righthand side is inner */
-			restrictinfo->outer_is_left = true;
+			expr = get_rightop(restrictinfo->clause);
 		}
 		else if (bms_is_empty(restrictinfo->right_relids))
 		{
 			/* lefthand side is inner */
-			restrictinfo->outer_is_left = false;
+			expr = get_leftop(restrictinfo->clause);
 		}
 		else
 			continue;
 
-		/* OK, add to list */
+		/*
+		 * Do not use non-strict clauses - those would allow for two inner
+		 * rows to match a single outer row: one with NULL in the join column
+		 * and one with NOT NULL.
+		 */
+		if (contain_nonstrict_functions((Node *) restrictinfo->clause))
+			continue;
+
+		/* Add the clause to the list. */
 		restrictlist = lappend(restrictlist, restrictinfo);
+		/* This is the expression that we'll try to match to the indexes. */
+		exprs = lappend(exprs, expr);
 	}
 
 	/* Short-circuit the easy case */
@@ -4187,91 +4217,163 @@ relation_has_unique_index_for(PlannerInfo *root, RelOptInfo *rel,
 	foreach(ic, rel->indexlist)
 	{
 		IndexOptInfo *ind = (IndexOptInfo *) lfirst(ic);
-		int			c;
-		List	   *exprs = NIL;
 
-		/*
-		 * If the index is not unique, or not immediately enforced, or if it's
-		 * a partial index, it's useless here.  We're unable to make use of
-		 * predOK partial unique indexes due to the fact that
-		 * check_index_predicates() also makes use of join predicates to
-		 * determine if the partial index is usable. Here we need proofs that
-		 * hold true before any joins are evaluated.
-		 */
-		if (!ind->unique || !ind->immediate || ind->indpred != NIL)
-			continue;
+		if (index_is_unique_for_exprs(root, ind, exprs, restrictlist,
+									  extra_clauses))
+			return true;
+	}
+	return false;
+}
 
-		/*
-		 * Try to find each index column in the list of conditions.  This is
-		 * O(N^2) or worse, but we expect all the lists to be short.
-		 */
-		for (c = 0; c < ind->nkeycolumns; c++)
+/*
+ * relation_has_unique_index_for_exprs
+ *	  Like relation_has_unique_index_for() but receives the list of
+ *	  expressions to match against the indexes.
+ *
+ * 'restrictlist' is needed when the expressions come from join clauses - in
+ * that case i-th item of 'restrictlist' corresponds to i-th element of
+ * 'exprs'. We need it to check operator families of those clauses.
+ */
+bool
+relation_has_unique_index_for_exprs(PlannerInfo *root, RelOptInfo *rel,
+									List *exprs, List *restrictlist,
+									List **extra_clauses)
+{
+	ListCell   *ic;
+
+	/* Short-circuit if no indexes... */
+	if (rel->indexlist == NIL)
+		return false;
+
+	/* Examine each index of the relation ... */
+	foreach(ic, rel->indexlist)
+	{
+		IndexOptInfo *ind = lfirst_node(IndexOptInfo, ic);
+
+		if (index_is_unique_for_exprs(root, ind, exprs, restrictlist,
+									  extra_clauses))
+			return true;
+	}
+
+	return false;
+}
+
+/*
+ * Does scan of this index generate rows such that each row as a distinct
+ * values of expressions 'exprs'?
+ *
+ * We check if the index is unique and if its target expressions are a subset
+ * of 'exprs'. For example, if 'exprs' is {x, y} and the index key is 'x', the
+ * rows {x, y} must be distinct because each has a distinct value of 'x'. In
+ * contrast, if the index keys are {x, y, z}, the index scan can emit many
+ * identical rows {x, y} - each with a different value of 'z'. (Obviously, if
+ * the index keys are {x, z}, many identical rows {x, y} can be emitted
+ * because 'y' does not restrict the index search at all. Thus for given value
+ * of 'y', multiple rows of the index can have the same value of 'x' - each
+ * with different value of 'z'.)
+ *
+ * 'root' is only needed if 'extra_clauses_p' is a valid pointer.
+ *
+ * See relation_has_unique_index_for_exprs() for comments on 'restrictlist'.
+ */
+bool
+index_is_unique_for_exprs(PlannerInfo *root, IndexOptInfo *index, List *exprs,
+						  List *restrictlist, List **extra_clauses_p)
+{
+	int			c;
+	List	   *extra_clauses = NIL;
+
+	/*
+	 * If the index is not unique, or not immediately enforced, or if it's a
+	 * partial index, it's useless here.  We're unable to make use of predOK
+	 * partial unique indexes due to the fact that check_index_predicates()
+	 * also makes use of join predicates to determine if the partial index is
+	 * usable. Here we need proofs that hold true before any joins are
+	 * evaluated.
+	 */
+	if (!index->unique || !index->immediate || index->indpred != NIL)
+		return false;
+
+	/* No RestrictInfo's or one per expression. */
+	Assert(restrictlist == NIL ||
+		   list_length(exprs) == list_length(restrictlist));
+
+	/*
+	 * Try to find each index column in the list of expressions. This is
+	 * O(N^2) or worse, but we expect all the lists to be short.
+	 */
+	for (c = 0; c < index->nkeycolumns; c++)
+	{
+		ListCell   *lce, *lcri = NULL;
+
+		if (restrictlist)
+			lcri = list_head(restrictlist);
+
+		foreach(lce, exprs)
 		{
-			ListCell   *lc;
+			Node	   *expr = (Node *) lfirst(lce);
+			RestrictInfo	*rinfo = NULL;
 
-			foreach(lc, restrictlist)
+			/*
+			 * The condition's equality operator must be a member of the index
+			 * opfamily, else it is not asserting the right kind of equality
+			 * behavior for this index.  We check this first since it's
+			 * probably cheaper than match_index_to_operand().
+			 */
+			if (restrictlist)
 			{
-				RestrictInfo *rinfo = (RestrictInfo *) lfirst(lc);
-				Node	   *rexpr;
+				rinfo = lfirst_node(RestrictInfo, lcri);
 
 				/*
-				 * The condition's equality operator must be a member of the
-				 * index opfamily, else it is not asserting the right kind of
-				 * equality behavior for this index.  We check this first
-				 * since it's probably cheaper than match_index_to_operand().
+				 * Now that we have 'rinfo', advance 'lcri' so that we're
+				 * ready for the next iteration.
 				 */
-				if (!list_member_oid(rinfo->mergeopfamilies, ind->opfamily[c]))
-					continue;
+				lcri = lnext(restrictlist, lcri);
 
-				/*
-				 * XXX at some point we may need to check collations here too.
-				 * For the moment we assume all collations reduce to the same
-				 * notion of equality.
-				 */
+				if (!list_member_oid(rinfo->mergeopfamilies,
+									 index->opfamily[c]))
+					continue;
+			}
 
-				/* OK, see if the condition operand matches the index key */
-				if (rinfo->outer_is_left)
-					rexpr = get_rightop(rinfo->clause);
-				else
-					rexpr = get_leftop(rinfo->clause);
+			/*
+			 * XXX at some point we may need to check collations here too.
+			 * For the moment we assume all collations reduce to the same
+			 * notion of equality.
+			 */
 
-				if (match_index_to_operand(rexpr, c, ind))
+			if (match_index_to_operand(expr, c, index))
+			{
+				if (extra_clauses_p &&
+					bms_membership(rinfo->clause_relids) == BMS_SINGLETON)
 				{
-					if (bms_membership(rinfo->clause_relids) == BMS_SINGLETON)
-					{
-						MemoryContext oldMemCtx =
-							MemoryContextSwitchTo(root->planner_cxt);
-
-						/*
-						 * Add filter clause into a list allowing caller to
-						 * know if uniqueness have made not only by join
-						 * clauses.
-						 */
-						Assert(bms_is_empty(rinfo->left_relids) ||
-							   bms_is_empty(rinfo->right_relids));
-						if (extra_clauses)
-							exprs = lappend(exprs, rinfo);
-						MemoryContextSwitchTo(oldMemCtx);
-					}
-
-					break;		/* found a match; column is unique */
+					MemoryContext oldMemCtx =
+						MemoryContextSwitchTo(root->planner_cxt);
+
+					/*
+					 * Add filter clause into a list allowing caller to know
+					 * if uniqueness have made not only by join clauses.
+					 *
+					 * Callers that pass 'extra_clauses_p' currently also pass
+					 * 'restrictlist'.
+					 */
+					Assert(bms_is_empty(rinfo->left_relids) ||
+						   bms_is_empty(rinfo->right_relids));
+					extra_clauses = lappend(extra_clauses, rinfo);
+					MemoryContextSwitchTo(oldMemCtx);
 				}
-			}
 
-			if (lc == NULL)
-				break;			/* no match; this index doesn't help us */
+				break;		/* found a match; column is unique */
+			}
 		}
 
-		/* Matched all key columns of this index? */
-		if (c == ind->nkeycolumns)
-		{
-			if (extra_clauses)
-				*extra_clauses = exprs;
-			return true;
-		}
+		if (lce == NULL)
+			return false;		/* no match; this index doesn't help us */
 	}
 
-	return false;
+	/* Matched all key columns of this index. */
+	if (extra_clauses_p)
+		*extra_clauses_p = extra_clauses;
+	return true;
 }
 
 /*
diff --git a/src/backend/optimizer/plan/analyzejoins.c b/src/backend/optimizer/plan/analyzejoins.c
index e592e1ac3d1..34fa6f28fe5 100644
--- a/src/backend/optimizer/plan/analyzejoins.c
+++ b/src/backend/optimizer/plan/analyzejoins.c
@@ -24,6 +24,7 @@
 
 #include "catalog/pg_class.h"
 #include "nodes/nodeFuncs.h"
+#include "optimizer/clauses.h"
 #include "optimizer/joininfo.h"
 #include "optimizer/optimizer.h"
 #include "optimizer/pathnode.h"
@@ -980,6 +981,24 @@ static bool
 rel_is_distinct_for(PlannerInfo *root, RelOptInfo *rel, List *clause_list,
 					List **extra_clauses)
 {
+	List *clause_list_new = NIL;
+	ListCell	*lc;
+
+	/*
+	 * Do not use non-strict clauses - those would allow for two inner rows to
+	 * match a single outer row: one with NULL in the join column and one with
+	 * NOT NULL.
+	 */
+	foreach(lc, clause_list)
+	{
+		RestrictInfo	*ri = lfirst_node(RestrictInfo, lc);
+
+		if (contain_nonstrict_functions((Node *) ri->clause))
+			continue;
+		clause_list_new = lappend(clause_list_new, ri);
+	}
+	clause_list = clause_list_new;
+
 	/*
 	 * We could skip a couple of tests here if we assume all callers checked
 	 * rel_supports_distinctness first, but it doesn't seem worth taking any
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index c4fd646b999..d85aedf56f4 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -208,6 +208,7 @@ static RelOptInfo *create_final_distinct_paths(PlannerInfo *root,
 static List *get_useful_pathkeys_for_distinct(PlannerInfo *root,
 											  List *needed_pathkeys,
 											  List *path_pathkeys);
+static bool path_is_distinct_for(Path *path, List *exprs);
 static RelOptInfo *create_ordered_paths(PlannerInfo *root,
 										RelOptInfo *input_rel,
 										PathTarget *target,
@@ -5066,6 +5067,11 @@ create_final_distinct_paths(PlannerInfo *root, RelOptInfo *input_rel,
 	Path	   *cheapest_input_path = input_rel->cheapest_total_path;
 	double		numDistinctRows;
 	bool		allow_hash;
+	List	   *distinctExprs;
+	ListCell	*lc;
+
+	distinctExprs = get_sortgrouplist_exprs(root->processed_distinctClause,
+											parse->targetList);
 
 	/* Estimate number of distinct rows there will be */
 	if (parse->groupClause || parse->groupingSets || parse->hasAggs ||
@@ -5083,10 +5089,6 @@ create_final_distinct_paths(PlannerInfo *root, RelOptInfo *input_rel,
 		/*
 		 * Otherwise, the UNIQUE filter has effects comparable to GROUP BY.
 		 */
-		List	   *distinctExprs;
-
-		distinctExprs = get_sortgrouplist_exprs(root->processed_distinctClause,
-												parse->targetList);
 		numDistinctRows = estimate_num_groups(root, distinctExprs,
 											  cheapest_input_path->rows,
 											  NULL, NULL);
@@ -5099,7 +5101,7 @@ create_final_distinct_paths(PlannerInfo *root, RelOptInfo *input_rel,
 	{
 		/*
 		 * Firstly, if we have any adequately-presorted paths, just stick a
-		 * Unique node on those.  We also, consider doing an explicit sort of
+		 * Unique node on those.  We also consider doing an explicit sort of
 		 * the cheapest input path and Unique'ing that.  If any paths have
 		 * presorted keys then we'll create an incremental sort atop of those
 		 * before adding a unique node on the top.  We'll also attempt to
@@ -5114,7 +5116,6 @@ create_final_distinct_paths(PlannerInfo *root, RelOptInfo *input_rel,
 		 * the other.)
 		 */
 		List	   *needed_pathkeys;
-		ListCell   *lc;
 		double		limittuples = root->distinct_pathkeys == NIL ? 1.0 : -1.0;
 
 		if (parse->hasDistinctOn &&
@@ -5130,6 +5131,13 @@ create_final_distinct_paths(PlannerInfo *root, RelOptInfo *input_rel,
 			Path	   *sorted_path;
 			List	   *useful_pathkeys_list = NIL;
 
+			/* Does the path already produce distinct rows? */
+			if (path_is_distinct_for(input_path, distinctExprs))
+			{
+				add_path(distinct_rel, input_path);
+				continue;
+			}
+
 			useful_pathkeys_list =
 				get_useful_pathkeys_for_distinct(root,
 												 needed_pathkeys,
@@ -5192,6 +5200,10 @@ create_final_distinct_paths(PlannerInfo *root, RelOptInfo *input_rel,
 		}
 	}
 
+	/* Below we process cheapest_input_path, so check if it's necessary. */
+	if (path_is_distinct_for(cheapest_input_path, distinctExprs))
+		return distinct_rel;
+
 	/*
 	 * Consider hash-based implementations of DISTINCT, if possible.
 	 *
@@ -5306,6 +5318,223 @@ get_useful_pathkeys_for_distinct(PlannerInfo *root, List *needed_pathkeys,
 	return useful_pathkeys_list;
 }
 
+/*
+ * Does the path generate rows such that each row has a distinct values of
+ * expressions 'exprs'?
+ *
+ * For index path, we check if the index is unique and if it its key
+ * expressions are a subset of the DISTINCT expressions - see the header
+ * comment of index_is_unique_for_exprs() for details.
+ *
+ * For sequential scan, we check if at least one (unique) index that
+ * guarantees the distinctness exists - see
+ * relation_has_unique_index_for_exprs().
+ *
+ * For both sequential and index path we require that all attributes appearing
+ * in 'exprs' are NOT NULL.
+ *
+ * UniquePath is considered similar to the unique index: the output is
+ * considered distinct if the path keys are a subset of the expressions that
+ * form the output row.
+ *
+ * For joins, the theory is that the rows of 'exprs' are distinct if 1) the
+ * rows of expressions belonging to the outer path are unique and 2) there is
+ * at most one inner row per outer row, ie the inner path does not duplicate
+ * the outer rows. The latter assumes that all the join/restriction clauses
+ * are strict, otherwise two inner rows could match a single outer row: one
+ * with NULL in a join column and the other with NOT NULL.
+ *
+ * Other paths that we accept here do not affect the distinctness, so we only
+ * recurse into their input paths.
+ */
+static bool
+path_is_distinct_for(Path *path, List *exprs)
+{
+	ListCell	*lc;
+	Path	*subpath = NULL;
+
+	/* No distinct rows if the row is not defined. */
+	if (exprs == NIL)
+		return false;
+
+	/* Does any expression reference a nullable attribute? */
+	if (IsA(path, Path) || IsA(path, IndexPath))
+	{
+		Bitmapset	*attrs = NULL;
+		RelOptInfo	*rel = path->parent;
+		int		prev;
+
+		/*
+		 * If all attributes are nullable, it makes no sense to check the
+		 * expressions.
+		 */
+		if (bms_is_empty(rel->notnullattnums))
+			return false;
+
+		pull_varattnos((Node *) exprs, rel->relid, &attrs);
+		prev = -1;
+		while ((prev = bms_next_member(attrs, prev)) >= 0)
+		{
+			int		attnum = prev + FirstLowInvalidHeapAttributeNumber;
+
+			/*
+			 * A single nullable attribute is the reason to reject the path.
+			 */
+			if (!bms_is_member(attnum, rel->notnullattnums))
+				return false;
+		}
+	}
+
+	/* First, handle paths that do not affect distinctness. */
+	/*
+	 * TODO Consider other paths that can be created in the scan/join
+	 * planning.
+	 */
+	if (IsA(path, SortPath))
+	{
+		subpath = castNode(SortPath, path)->subpath;
+	}
+	else if (IsA(path, IncrementalSortPath))
+	{
+		subpath = castNode(IncrementalSortPath, path)->spath.subpath;
+	}
+	else if (IsA(path, ProjectionPath))
+	{
+		subpath = castNode(ProjectionPath, path)->subpath;
+	}
+	else if (IsA(path, GatherPath))
+	{
+		subpath = castNode(GatherPath, path)->subpath;
+	}
+	else if (IsA(path, GatherMergePath))
+	{
+		subpath = castNode(GatherMergePath, path)->subpath;
+	}
+	else if (IsA(path, MaterialPath))
+	{
+		subpath = castNode(MaterialPath, path)->subpath;
+	}
+	/* The following paths are handled below. */
+	else if (!(IsA(path, Path) || IsA(path, IndexPath) ||
+			   IsA(path, NestPath) || IsA(path, HashPath) ||
+			   IsA(path, MergePath) || IsA(path, UniquePath)))
+	{
+		/* Check not implemented. */
+		return false;
+	}
+
+	if (subpath)
+		return path_is_distinct_for(subpath, exprs);
+
+	if (IsA(path, Path))
+	{
+		if (relation_has_unique_index_for_exprs(NULL, path->parent, exprs,
+												NIL, NULL))
+			return true;
+	}
+	else if (IsA(path, IndexPath))
+	{
+		IndexPath	*ipath = castNode(IndexPath, path);
+
+		if (index_is_unique_for_exprs(NULL, ipath->indexinfo, exprs, NIL,
+									  NULL))
+			return true;
+	}
+	/*
+	 * For join to generate unique rows (containing only the expressions
+	 * passed in 'exprs'), two conditions need to be met: 1) the outer side
+	 * generates unique rows of the corresponding expressions, 2) outer tuple
+	 * matches no more than one inner tuple.
+	 */
+	else if (IsA(path, NestPath) || IsA(path, HashPath) ||
+			 IsA(path, MergePath))
+	{
+		JoinPath	*jpath = (JoinPath *) path;
+		List	*exprs_outer = NIL;
+
+		/* The join must not duplicate rows. */
+		if (!jpath->inner_unique)
+			return false;
+
+		/*
+		 * Retrieve expressions belonging to the outer path.
+		 */
+		foreach(lc, exprs)
+		{
+			Expr	*expr = (Expr *) lfirst(lc);
+			ListCell	*lc2;
+
+			foreach(lc2, jpath->outerjoinpath->pathtarget->exprs)
+			{
+				Expr	*texpr = (Expr *) lfirst(lc2);
+
+				if (equal(expr, texpr))
+				{
+					exprs_outer = lappend(exprs_outer, texpr);
+					break;
+				}
+			}
+		}
+
+		/*
+		 * Since ->inner_unique assumes that all the join/restrict clauses are
+		 * strict, we don't have to worry about NULL-extended rows. Therefore
+		 * we don't bother to check JoinType.
+		 */
+
+		/* Finally, check distinctness of the outer side. */
+		return path_is_distinct_for(jpath->outerjoinpath, exprs_outer);
+	}
+
+	if (IsA(path, UniquePath))
+	{
+		UniquePath	*upath = (UniquePath *) path;
+		int		keyno = 0;
+
+		/*
+		 * Check that the set of unique expressions generated by the path is a
+		 * subset of 'exprs', otherwise values of 'exprs' could repeat.
+		 */
+		foreach(lc, upath->path.pathkeys)
+		{
+			PathKey    *pathkey = (PathKey *) lfirst(lc);
+			EquivalenceClass *ec = pathkey->pk_eclass;
+			ListCell	*lc2;
+
+			if (keyno >= upath->numkeys)
+				break;
+
+			/*
+			 * TODO Can we do more? According to make_unique_from_pathkeys(),
+			 * the EC is specific to one particular targetlist entry, however
+			 * we can't easily check where the items of 'exprs' come from. At
+			 * least elaborate this comment.
+			 */
+			if (ec->ec_has_volatile)
+				return false;
+
+			foreach(lc2, exprs)
+			{
+				Expr	*expr = (Expr *) lfirst(lc2);
+				EquivalenceMember	*em;
+
+				em = find_ec_member_matching_expr(ec, expr, NULL);
+				if (em == NULL)
+					/*
+					 * The unique expressions are not a subset of 'exprs'. In
+					 * other words, rows which differ in this expression can
+					 * have identical set of 'exprs'.
+					 */
+					return false;
+			}
+
+			keyno++;
+		}
+	}
+
+	return false;
+}
+
 /*
  * create_ordered_paths
  *
diff --git a/src/include/optimizer/paths.h b/src/include/optimizer/paths.h
index f6a62df0b43..613fc2aa947 100644
--- a/src/include/optimizer/paths.h
+++ b/src/include/optimizer/paths.h
@@ -76,6 +76,14 @@ extern void create_index_paths(PlannerInfo *root, RelOptInfo *rel);
 extern bool relation_has_unique_index_for(PlannerInfo *root, RelOptInfo *rel,
 										  List *restrictlist,
 										  List **extra_clauses);
+extern bool relation_has_unique_index_for_exprs(PlannerInfo *root,
+												RelOptInfo *rel,
+												List *exprs,
+												List *restrictlist,
+												List **extra_clauses);
+extern bool index_is_unique_for_exprs(PlannerInfo *root, IndexOptInfo *index,
+									  List *exprs, List *restrictlist,
+									  List **extra_clauses_p);
 extern bool indexcol_is_bool_constant_for_query(PlannerInfo *root,
 												IndexOptInfo *index,
 												int indexcol);
diff --git a/src/test/regress/expected/distinct.out b/src/test/regress/expected/distinct.out
new file mode 100644
index 00000000000..d94c6153e4e
--- /dev/null
+++ b/src/test/regress/expected/distinct.out
@@ -0,0 +1,117 @@
+begin;
+create schema test_distinct;
+set search_path to test_distinct;
+-- Single relation.
+create table a (
+  i integer primary key,
+  j integer
+);
+explain (verbose on, costs off)
+select distinct i from a;
+         QUERY PLAN          
+-----------------------------
+ Seq Scan on test_distinct.a
+   Output: i
+(2 rows)
+
+-- Explicit unique-ification is needed if there's no unique index on the
+-- "j" column.
+explain (verbose on, costs off)
+select distinct j from a;
+            QUERY PLAN             
+-----------------------------------
+ HashAggregate
+   Output: j
+   Group Key: a.j
+   ->  Seq Scan on test_distinct.a
+         Output: i, j
+(5 rows)
+
+create unique index on a(j);
+-- The unique index alone is not sufficient if the column is nullable.
+explain (verbose on, costs off)
+select distinct j from a;
+            QUERY PLAN             
+-----------------------------------
+ HashAggregate
+   Output: j
+   Group Key: a.j
+   ->  Seq Scan on test_distinct.a
+         Output: i, j
+(5 rows)
+
+alter table a alter column j set not null;
+-- Now we only need sequential scan.
+explain (verbose on, costs off)
+select distinct j from a;
+         QUERY PLAN          
+-----------------------------
+ Seq Scan on test_distinct.a
+   Output: j
+(2 rows)
+
+-- Joins.
+create table b (
+  k integer,
+  l integer
+);
+explain (verbose on, costs off)
+select distinct i, l from a join b on i = k;
+                  QUERY PLAN                   
+-----------------------------------------------
+ HashAggregate
+   Output: a.i, b.l
+   Group Key: a.i, b.l
+   ->  Hash Join
+         Output: a.i, b.l
+         Inner Unique: true
+         Hash Cond: (b.k = a.i)
+         ->  Seq Scan on test_distinct.b
+               Output: b.k, b.l
+         ->  Hash
+               Output: a.i
+               ->  Seq Scan on test_distinct.a
+                     Output: a.i
+(13 rows)
+
+-- "Inner Unique" is essential to prove that the join output is distinct,
+-- but to set it the planner needs an unique index on the inner table.
+create unique index on b(k);
+-- No explicit unique-ification is needed now.
+explain (verbose on, costs off)
+select distinct i, l from a join b on i = k;
+               QUERY PLAN                
+-----------------------------------------
+ Hash Join
+   Output: a.i, b.l
+   Inner Unique: true
+   Hash Cond: (a.i = b.k)
+   ->  Seq Scan on test_distinct.a
+         Output: a.i, a.j
+   ->  Hash
+         Output: b.l, b.k
+         ->  Seq Scan on test_distinct.b
+               Output: b.l, b.k
+(10 rows)
+
+-- Non-strict join clause prevents the planner from setting "Inner Unique",
+-- so explicit unique-ification cannot be skipped.
+explain (verbose on, costs off)
+select distinct i, l from a join b on i = coalesce(k, 0);
+                  QUERY PLAN                   
+-----------------------------------------------
+ HashAggregate
+   Output: a.i, b.l
+   Group Key: a.i, b.l
+   ->  Hash Join
+         Output: a.i, b.l
+         Hash Cond: (COALESCE(b.k, 0) = a.i)
+         ->  Seq Scan on test_distinct.b
+               Output: b.k, b.l
+         ->  Hash
+               Output: a.i
+               ->  Seq Scan on test_distinct.a
+                     Output: a.i
+(12 rows)
+
+rollback;
diff --git a/src/test/regress/parallel_schedule b/src/test/regress/parallel_schedule
index f56482fb9f1..54427510bac 100644
--- a/src/test/regress/parallel_schedule
+++ b/src/test/regress/parallel_schedule
@@ -63,6 +63,8 @@ test: sanity_check
 # ----------
 test: select_into select_distinct select_distinct_on select_implicit select_having subselect union case join aggregates transactions random portals arrays btree_index hash_index update delete namespace prepared_xacts
 
+test: distinct
+
 # ----------
 # Another group of parallel tests
 # ----------
diff --git a/src/test/regress/sql/distinct.sql b/src/test/regress/sql/distinct.sql
new file mode 100644
index 00000000000..19e837ffef7
--- /dev/null
+++ b/src/test/regress/sql/distinct.sql
@@ -0,0 +1,54 @@
+begin;
+
+create schema test_distinct;
+set search_path to test_distinct;
+
+-- Single relation.
+create table a (
+  i integer primary key,
+  j integer
+);
+
+explain (verbose on, costs off)
+select distinct i from a;
+
+-- Explicit unique-ification is needed if there's no unique index on the
+-- "j" column.
+explain (verbose on, costs off)
+select distinct j from a;
+
+create unique index on a(j);
+
+-- The unique index alone is not sufficient if the column is nullable.
+explain (verbose on, costs off)
+select distinct j from a;
+
+alter table a alter column j set not null;
+
+-- Now we only need sequential scan.
+explain (verbose on, costs off)
+select distinct j from a;
+
+-- Joins.
+create table b (
+  k integer,
+  l integer
+);
+
+explain (verbose on, costs off)
+select distinct i, l from a join b on i = k;
+
+-- "Inner Unique" is essential to prove that the join output is distinct,
+-- but to set it the planner needs an unique index on the inner table.
+create unique index on b(k);
+
+-- No explicit unique-ification is needed now.
+explain (verbose on, costs off)
+select distinct i, l from a join b on i = k;
+
+-- Non-strict join clause prevents the planner from setting "Inner Unique",
+-- so explicit unique-ification cannot be skipped.
+explain (verbose on, costs off)
+select distinct i, l from a join b on i = coalesce(k, 0);
+
+rollback;
\ No newline at end of file
-- 
2.47.3

Avoid unnecessary processing of DISTINCT clause (Was: Unique Keys)

Reply via email to