Hello,

Thank you for review.
My answers are inside.


On 21.01.2021 15:30, Yugo NAGATA wrote:
Hello,

On Thu, 26 Mar 2020 18:49:51 +0300
Konstantin Knizhnik <k.knizh...@postgrespro.ru> wrote:

Attached please find new version of the patch with more comments and
descriptions added.
Adaptive query optimization is very interesting feature for me, so I looked
into this patch.  Here are some comments and questions.

(1)
This patch needs rebase because clauselist_selectivity was modified to improve
estimation of OR clauses.

Rebased version is attached.

(2)
If I understand correctly, your proposal consists of the following two features.

1. Add a feature to auto_explain that creates an extended statistic 
automatically
if an error on estimated rows number is large.

2. Improve rows number estimation of join results by considering functional
dependencies between vars in the join qual if the qual has more than one 
clauses,
and also functional dependencies between a var in the join qual and vars in 
quals
of the inner/outer relation.

As you said, these two parts are independent each other, so one feature will 
work
even if we don't assume the other.  I wonder it would be better to split the 
patch
again, and register them to commitfest separately.

I agree with you that this are two almost unrelated changes, although without clausesel patch additional statistic can not improve query planning.
But I already have too many patches at commitfest.
May be it will be enough to spit this patch into two?


(3)
+       DefineCustomBoolVariable("auto_explain.suggest_only",
+                                                        "Do not create statistic 
but just record in WAL suggested create statistics statement.",
+                                                        NULL,
+                                                        
&auto_explain_suggest_on

To imply that this parameter is involving to add_statistics_threshold, it seems
better for me to use more related name like add_statistics_suggest_only.

Also, additional documentations for new parameters are required.

Done.


(4)
+                       /*
+                        * Prevent concurrent access to extended statistic table
+                        */
+                       stat_rel = table_open(StatisticExtRelationId, 
AccessExclusiveLock);
+                       slot = table_slot_create(stat_rel, NULL);
+                       scan = table_beginscan_catalog(stat_rel, 2, entry);
(snip)
+                       table_close(stat_rel, AccessExclusiveLock);
+               }

When I tested the auto_explain part, I got the following WARNING.

  WARNING:  buffer refcount leak: [097] (rel=base/12879/3381, blockNum=0, 
flags=0x83000000, refcount=1 2)
  WARNING:  buffer refcount leak: [097] (rel=base/12879/3381, blockNum=0, 
flags=0x83000000, refcount=1 1)
  WARNING:  relcache reference leak: relation "pg_statistic_ext" not closed
  WARNING:  TupleDesc reference leak: TupleDesc 0x7fa439266338 (12029,-1) still 
referenced
  WARNING:  Snapshot reference leak: Snapshot 0x55c332c10418 still referenced

To suppress this, I think we need table_endscan(scan) and
ExecDropSingleTupleTableSlot(slot) before finishing this function.

Thank you for noticing the problem, fixed.



(6)
+                                       elog(NOTICE, "Auto_explain suggestion: 
CREATE STATISTICS %s %s FROM %s", stat_name, create_stat_stmt, rel_name);

We should use ereport instead of elog for log messages.

Changed.


(7)
+                                               double dep = 
find_var_dependency(root, innerRelid, var, clauses_attnums);
+                                               if (dep != 0.0)
+                                               {
+                                                       s1 *= dep + (1 - dep) * 
s2;
+                                                       continue;
+                                               }

I found the following comment of clauselist_apply_dependencies():

   * we actually combine selectivities using the formula
   *
   *      P(a,b) = f * Min(P(a), P(b)) + (1-f) * P(a) * P(b)

so, is it not necessary using the same formula in this patch? That is,

    s1 *= dep + (1-dep) * s2                   (if s1 <= s2)
   s1 *= dep * (s2/s1) + (1-dep) * s2   (otherwise) .
Makes sense.

(8)
+/*
+ * Try to find dependency between variables.
+ * var: varaibles which dependencies are considered
+ * join_vars: list of variables used in other clauses
+ * This functions return strongest dependency and some subset of variables 
from the same relation
+ * or 0.0 if no dependency was found.
+ */
+static double
+var_depends_on(PlannerInfo *root, Var* var, List* clause_vars)
+{

The comment mentions join_vars but the actual argument name is clauses_vars,
so it needs unification.

Fixed.


(9)
Currently, it only consider functional dependencies statistics. Can we also
consider multivariate MCV list, and is it useful?


Right now auto_explain create statistic without explicit specification of statistic kind. According to the documentation all supported statistics kinds should be created in this case:

/|statistics_kind|/

   A statistics kind to be computed in this statistics object.
   Currently supported kinds are |ndistinct|, which enables n-distinct
   statistics, and |dependencies|, which enables functional dependency
   statistics. If this clause is omitted, all supported statistics
   kinds are included in the statistics object. For more information,
   see Section 14.2.2
   
<https://www.postgresql.org/docs/10/planner-stats.html#PLANNER-STATS-EXTENDED>
   and Section 68.2
   <https://www.postgresql.org/docs/10/multivariate-statistics-examples.html>.




(10)
To achieve adaptive query optimization  (AQO) in PostgreSQL, this patch proposes
to use auto_explain for getting feedback from actual results. So, could 
auto_explain
be a infrastructure of AQO in future? Or, do you have any plan or idea to make
built-in infrastructure for AQO?
Sorry, I do not have answer for this question.
I just patched auto_explain extension because it is doing  half of the required work (analyze  expensive statements). It can be certainly moved to separate extension. In this case it will party duplicate existed functionality and settings of auto_explain (like statement execution time threshold). I am not sure that it is good. But from the other side, this my patch makes auto_explain extension to do some unexpected work...

Actually task of adaptive query optimization is much bigger.
We have separate AQO extension which tries to use machine learning to correctly adjust estimations. This my patch is much simpler and use existed mechanism (extended statistics) to improve estimations.

--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

diff --git a/contrib/auto_explain/auto_explain.c b/contrib/auto_explain/auto_explain.c
index faa6231..5d8f088 100644
--- a/contrib/auto_explain/auto_explain.c
+++ b/contrib/auto_explain/auto_explain.c
@@ -13,12 +13,32 @@
 #include "postgres.h"
 
 #include <limits.h>
+#include <math.h>
 
+#include "access/hash.h"
 #include "access/parallel.h"
+#include "access/relscan.h"
+#include "access/skey.h"
+#include "access/table.h"
+#include "access/tableam.h"
+#include "catalog/pg_statistic_ext.h"
 #include "commands/explain.h"
+#include "commands/defrem.h"
 #include "executor/instrument.h"
 #include "jit/jit.h"
+#include "nodes/makefuncs.h"
+#include "nodes/nodeFuncs.h"
+#include "optimizer/cost.h"
+#include "optimizer/optimizer.h"
+#include "optimizer/planmain.h"
+#include "parser/parsetree.h"
+#include "storage/ipc.h"
+#include "statistics/statistics.h"
+#include "utils/fmgroids.h"
 #include "utils/guc.h"
+#include "utils/syscache.h"
+#include "utils/lsyscache.h"
+#include "utils/ruleutils.h"
 
 PG_MODULE_MAGIC;
 
@@ -34,7 +54,9 @@ static bool auto_explain_log_settings = false;
 static int	auto_explain_log_format = EXPLAIN_FORMAT_TEXT;
 static int	auto_explain_log_level = LOG;
 static bool auto_explain_log_nested_statements = false;
+static bool auto_explain_add_statistics_suggest_only = false;
 static double auto_explain_sample_rate = 1;
+static double auto_explain_add_statistics_threshold = 0.0;
 
 static const struct config_enum_entry format_options[] = {
 	{"text", EXPLAIN_FORMAT_TEXT, false},
@@ -85,6 +107,7 @@ static void explain_ExecutorRun(QueryDesc *queryDesc,
 static void explain_ExecutorFinish(QueryDesc *queryDesc);
 static void explain_ExecutorEnd(QueryDesc *queryDesc);
 
+static void AddMultiColumnStatisticsForNode(PlanState *planstate, ExplainState *es);
 
 /*
  * Module load callback
@@ -230,6 +253,30 @@ _PG_init(void)
 							 NULL,
 							 NULL);
 
+	DefineCustomRealVariable("auto_explain.add_statistics_threshold",
+							 "Sets the threshold for actual/estimated #rows ratio triggering creation of multicolumn statistic for the related columns.",
+							 "Zero disables implicit creation of multicolumn statistic.",
+							 &auto_explain_add_statistics_threshold,
+							 0.0,
+							 0.0,
+							 INT_MAX,
+							 PGC_SUSET,
+							 0,
+							 NULL,
+							 NULL,
+							 NULL);
+
+	DefineCustomBoolVariable("auto_explain.add_statistics_suggest_only",
+							 "Do not create statistic but just record in WAL suggested create statistics statement.",
+							 NULL,
+							 &auto_explain_add_statistics_suggest_only,
+							 false,
+							 PGC_SUSET,
+							 0,
+							 NULL,
+							 NULL,
+							 NULL);
+
 	EmitWarningsOnPlaceholders("auto_explain");
 
 	/* Install hooks. */
@@ -363,6 +410,282 @@ explain_ExecutorFinish(QueryDesc *queryDesc)
 	PG_END_TRY();
 }
 
+/**
+ * Try to add multicolumn statistics for specified subplans.
+ */
+static void
+AddMultiColumnStatisticsForSubPlans(List *plans, ExplainState *es)
+{
+	ListCell   *lst;
+
+	foreach(lst, plans)
+	{
+		SubPlanState *sps = (SubPlanState *) lfirst(lst);
+
+		AddMultiColumnStatisticsForNode(sps->planstate, es);
+	}
+}
+
+/**
+ * Try to add multicolumn statistics for plan subnodes.
+ */
+static void
+AddMultiColumnStatisticsForMemberNodes(PlanState **planstates, int nsubnodes,
+									   ExplainState *es)
+{
+	int			j;
+
+	for (j = 0; j < nsubnodes; j++)
+		AddMultiColumnStatisticsForNode(planstates[j], es);
+}
+
+/**
+ * Comparator used to sort Vars by name
+ */
+static int
+vars_list_comparator(const ListCell *a, const ListCell *b)
+{
+	char* va = strVal((Value *) linitial(((ColumnRef *)lfirst(a))->fields));
+	char* vb = strVal((Value *) linitial(((ColumnRef *)lfirst(b))->fields));
+	return strcmp(va, vb);
+}
+
+/**
+ * Try to add multicolumn statistics for qual
+ */
+static void
+AddMultiColumnStatisticsForQual(void* qual, ExplainState *es)
+{
+	List *vars = NULL;
+	ListCell* lc;
+
+	/* Extract vars from all quals */
+	foreach (lc, qual)
+	{
+		Node* node = (Node*)lfirst(lc);
+		if (IsA(node, RestrictInfo))
+			node = (Node*)((RestrictInfo*)node)->clause;
+		vars = list_concat(vars, pull_vars_of_level(node, 0));
+	}
+
+	/* Loop until we considered all vars */
+	while (vars != NULL)
+	{
+		ListCell *cell;
+		List *cols = NULL;
+		Index relno = 0;
+		Bitmapset* colmap = NULL;
+
+		/* Contruct list of unique vars */
+		foreach (cell, vars)
+		{
+			Node* node = (Node *) lfirst(cell);
+			if (IsA(node, Var))
+			{
+				Var *var = (Var *) node;
+				int varno = IS_SPECIAL_VARNO(var->varno) ? var->varnosyn : var->varno;
+				if (cols == NULL || varno == relno)
+				{
+					relno = varno;
+					if (var->varattno > 0 &&
+						!bms_is_member(var->varattno, colmap) &&
+						varno >= 1 && /* not synthetic var */
+						varno <= list_length(es->rtable) &&
+						list_length(cols) < STATS_MAX_DIMENSIONS)
+					{
+						RangeTblEntry *rte = rt_fetch(varno, es->rtable);
+						if (rte->rtekind == RTE_RELATION)
+						{
+							ColumnRef  *col = makeNode(ColumnRef);
+							char *colname = get_rte_attribute_name(rte, var->varattno);
+							col->fields = list_make1(makeString(colname));
+							cols = lappend(cols, col);
+							colmap = bms_add_member(colmap, var->varattno);
+						}
+					}
+				}
+				else
+				{
+					continue;
+				}
+			}
+			vars = foreach_delete_current(vars, cell);
+		}
+		/* To create multicolumn statitics we need to have at least 2 columns */
+		if (list_length(cols) >= 2)
+		{
+			RangeTblEntry *rte = rt_fetch(relno, es->rtable);
+			CreateStatsStmt* stats = makeNode(CreateStatsStmt);
+			char *rel_namespace = get_namespace_name(get_rel_namespace(rte->relid));
+			char *rel_name = get_rel_name(rte->relid);
+			RangeVar* rel = makeRangeVar(rel_namespace, rel_name, 0);
+			char* stat_name = rel_name;
+			char* create_stat_stmt = (char*)"";
+			char const* sep = "ON";
+			ScanKeyData entry[2];
+			TableScanDesc scan;
+			Relation stat_rel;
+			size_t name_len;
+			TupleTableSlot *slot;
+
+			/* Sort variables by name */
+			list_sort(cols, vars_list_comparator);
+
+			/* Construct name for statistic by concatenating relation name with all columns */
+			foreach (cell, cols)
+			{
+				char* col_name = strVal((Value *) linitial(((ColumnRef *)lfirst(cell))->fields));
+				stat_name = psprintf("%s_%s", stat_name, col_name);
+				create_stat_stmt = psprintf("%s%s %s", create_stat_stmt, sep, col_name);
+				sep = ",";
+			}
+
+			name_len = strlen(stat_name);
+			/* Truncate name if it doesn't fit in NameData */
+			if (name_len >= NAMEDATALEN)
+				stat_name = psprintf("%.*s_%08x", NAMEDATALEN - 10, stat_name, (unsigned)hash_any((uint8*)stat_name, name_len));
+
+			ScanKeyInit(&entry[0],
+						Anum_pg_statistic_ext_stxname,
+						BTEqualStrategyNumber, F_NAMEEQ,
+						CStringGetDatum(stat_name));
+			ScanKeyInit(&entry[1],
+						Anum_pg_statistic_ext_stxnamespace,
+						BTEqualStrategyNumber, F_OIDEQ,
+						ObjectIdGetDatum(get_rel_namespace(rte->relid)));
+
+			/*
+			 * Prevent concurrent access to extended statistic table
+			 */
+			stat_rel = table_open(StatisticExtRelationId, AccessExclusiveLock);
+			slot = table_slot_create(stat_rel, NULL);
+			scan = table_beginscan_catalog(stat_rel, 2, entry);
+
+			/*
+			 * Check if multicolumn statistic object with such name already exists.
+			 * Most likely if was already created by auto_explain, but either ANALYZE was not performed since
+			 * this time, either presence of this multicolumn statistic doesn't help to provide more precise estimation.
+			 * Despite to the fact that we create statistics with "if_not_exist" option, presence of such check
+			 * allows to eliminate notice message that statistics object already exists.
+			 */
+			if (!table_scan_getnextslot(scan, ForwardScanDirection, slot))
+			{
+				if (auto_explain_add_statistics_suggest_only)
+				{
+					ereport(NOTICE, (errmsg("Auto_explain suggestion: CREATE STATISTICS %s %s FROM %s",
+											stat_name, create_stat_stmt, rel_name),
+									 errhidestmt(true)));
+				}
+				else
+				{
+					ereport(LOG, (errmsg("Add statistics %s", stat_name),
+								  errhidestmt(true)));
+					stats->defnames = list_make2(makeString(rel_namespace), makeString(stat_name));
+					stats->if_not_exists = true;
+					stats->relations = list_make1(rel);
+					stats->exprs = cols;
+					CreateStatistics(stats);
+				}
+			}
+			table_endscan(scan);
+			ExecDropSingleTupleTableSlot(slot);
+			table_close(stat_rel, AccessExclusiveLock);
+		}
+	}
+}
+
+/**
+ * Try to add multicolumn statistics for node
+ */
+static void
+AddMultiColumnStatisticsForNode(PlanState *planstate, ExplainState *es)
+{
+	Plan	   *plan = planstate->plan;
+
+	if (planstate->instrument && plan->plan_rows != 0)
+	{
+		if (auto_explain_add_statistics_threshold != 0
+			&& planstate->instrument->ntuples / plan->plan_rows >= auto_explain_add_statistics_threshold)
+		{
+			elog(DEBUG1, "Estimated=%f, actual=%f, error=%f: plan=%s", plan->plan_rows, planstate->instrument->ntuples, planstate->instrument->ntuples / plan->plan_rows, nodeToString(plan));
+			/* quals, sort keys, etc */
+			switch (nodeTag(plan))
+			{
+			  case T_IndexScan:
+				AddMultiColumnStatisticsForQual(((IndexScan *) plan)->indexqualorig, es);
+				break;
+			  case T_IndexOnlyScan:
+				AddMultiColumnStatisticsForQual(((IndexOnlyScan *) plan)->indexqual, es);
+				break;
+			  case T_BitmapIndexScan:
+				AddMultiColumnStatisticsForQual(((BitmapIndexScan *) plan)->indexqualorig, es);
+				break;
+			  case T_NestLoop:
+				AddMultiColumnStatisticsForQual(((NestLoop *) plan)->join.joinqual, es);
+				break;
+			  case T_MergeJoin:
+				AddMultiColumnStatisticsForQual(((MergeJoin *) plan)->mergeclauses, es);
+				AddMultiColumnStatisticsForQual(((MergeJoin *) plan)->join.joinqual, es);
+				break;
+			  case T_HashJoin:
+				AddMultiColumnStatisticsForQual(((HashJoin *) plan)->hashclauses, es);
+				AddMultiColumnStatisticsForQual(((HashJoin *) plan)->join.joinqual, es);
+				break;
+			  default:
+				break;
+			}
+			AddMultiColumnStatisticsForQual(plan->qual, es);
+		}
+	}
+
+	/* initPlan-s */
+	if (planstate->initPlan)
+		AddMultiColumnStatisticsForSubPlans(planstate->initPlan, es);
+
+	/* lefttree */
+	if (outerPlanState(planstate))
+		AddMultiColumnStatisticsForNode(outerPlanState(planstate), es);
+
+	/* righttree */
+	if (innerPlanState(planstate))
+		AddMultiColumnStatisticsForNode(innerPlanState(planstate), es);
+
+	/* special child plans */
+	switch (nodeTag(plan))
+	{
+		case T_ModifyTable:
+			AddMultiColumnStatisticsForMemberNodes(((ModifyTableState *) planstate)->mt_plans,
+												   ((ModifyTableState *) planstate)->mt_nplans,
+												   es);
+			break;
+		case T_Append:
+			AddMultiColumnStatisticsForMemberNodes(((AppendState *) planstate)->appendplans,
+												   ((AppendState *) planstate)->as_nplans,
+												   es);
+			break;
+		case T_MergeAppend:
+			AddMultiColumnStatisticsForMemberNodes(((MergeAppendState *) planstate)->mergeplans,
+												   ((MergeAppendState *) planstate)->ms_nplans,
+												   es);
+			break;
+		case T_BitmapAnd:
+			AddMultiColumnStatisticsForMemberNodes(((BitmapAndState *) planstate)->bitmapplans,
+												   ((BitmapAndState *) planstate)->nplans,
+												   es);
+			break;
+		case T_BitmapOr:
+			AddMultiColumnStatisticsForMemberNodes(((BitmapOrState *) planstate)->bitmapplans,
+												   ((BitmapOrState *) planstate)->nplans,
+												   es);
+			break;
+		case T_SubqueryScan:
+			AddMultiColumnStatisticsForNode(((SubqueryScanState *) planstate)->subplan, es);
+			break;
+		default:
+			break;
+	}
+}
+
 /*
  * ExecutorEnd hook: log results if needed
  */
@@ -403,6 +726,10 @@ explain_ExecutorEnd(QueryDesc *queryDesc)
 				ExplainPrintJITSummary(es, queryDesc);
 			ExplainEndOutput(es);
 
+			/* Add multicolumn statistic if requested */
+			if (auto_explain_add_statistics_threshold && !IsParallelWorker())
+				AddMultiColumnStatisticsForNode(queryDesc->planstate, es);
+
 			/* Remove last line break */
 			if (es->str->len > 0 && es->str->data[es->str->len - 1] == '\n')
 				es->str->data[--es->str->len] = '\0';
diff --git a/doc/src/sgml/auto-explain.sgml b/doc/src/sgml/auto-explain.sgml
index 30e35a7..559f7be 100644
--- a/doc/src/sgml/auto-explain.sgml
+++ b/doc/src/sgml/auto-explain.sgml
@@ -280,6 +280,41 @@ LOAD 'auto_explain';
      </para>
     </listitem>
    </varlistentry>
+
+   <varlistentry>
+    <term>
+     <varname>auto_explain.auto_explain.add_statistics_threshold</varname> (<type>real</type>)
+     <indexterm>
+      <primary><varname>auto_explain.add_statistics_threshold</varname> configuration parameter</primary>
+     </indexterm>
+    </term>
+    <listitem>
+     <para>
+       <varname>auto_explain.add_statistics_threshold</varname> sets the threshold for
+       actual/estimated #rows ratio triggering creation of multicolumn statistic
+       for the related columns. It can be used for adpative query optimization.
+       If there is large gap between real and estimated number of tuples for the
+       concrete plan node, then multicolumn statistic is created for involved
+       attributes. Zero value (default) disables implicit creation of multicolumn statistic.
+     </para>
+    </listitem>
+   </varlistentry>
+
+   <varlistentry>
+    <term>
+     <varname>auto_explain.auto_explain.add_statistics_suggest_only</varname> (<type>boolean</type>)
+     <indexterm>
+      <primary><varname>auto_explain.add_statistics_suggest_only</varname> configuration parameter</primary>
+     </indexterm>
+    </term>
+    <listitem>
+     <para>
+       <varname>auto_explain.add_statistics_suggest_only</varname> disables creation
+       of multicolumn statistic even if <varname>auto_explain.add_statistics_threshold</varname>
+       contains non zero value.Only log message will be reported in case of bad estimation.
+     </para>
+    </listitem>
+   </varlistentry>
   </variablelist>
 
   <para>
diff --git a/src/backend/optimizer/path/clausesel.c b/src/backend/optimizer/path/clausesel.c
index d263ecf..5446ad5 100644
--- a/src/backend/optimizer/path/clausesel.c
+++ b/src/backend/optimizer/path/clausesel.c
@@ -25,6 +25,8 @@
 #include "utils/fmgroids.h"
 #include "utils/lsyscache.h"
 #include "utils/selfuncs.h"
+#include "statistics/statistics.h"
+#include "catalog/pg_statistic_ext.h"
 
 /*
  * Data structure for accumulating info about possible range-query
@@ -56,6 +58,47 @@ static Selectivity clauselist_selectivity_or(PlannerInfo *root,
  ****************************************************************************/
 
 /*
+ * Find functional dependency between attributes using multicolumn statistic.
+ * relid:   index of relation to which all considered attributes belong
+ * var:     variable which dependencies are inspected
+ * attnums: set of considered attributes included specified variables
+ * This function return degree of strongest dependency between some subset of this attributes
+ * and specified variable or 0.0 if on dependency is found.
+ */
+double
+find_var_dependency(PlannerInfo *root, Index relid, Var *var, Bitmapset *attnums)
+{
+	RelOptInfo* rel = find_base_rel(root, relid);
+	if (rel->rtekind == RTE_RELATION && rel->statlist != NIL)
+	{
+		StatisticExtInfo *stat = choose_best_statistics(rel->statlist, STATS_EXT_DEPENDENCIES,
+														&attnums, 1);
+		if (stat != NULL)
+		{
+			MVDependencies *dependencies = statext_dependencies_load(stat->statOid);
+			MVDependency *strongest = NULL;
+			int i;
+			for (i = 0; i < dependencies->ndeps; i++)
+			{
+				MVDependency *dependency = dependencies->deps[i];
+				int n_dep_vars = dependency->nattributes - 1;
+				/* Dependency implies attribute */
+				if (var->varattno == dependency->attributes[n_dep_vars])
+				{
+					while (--n_dep_vars >= 0
+						   && bms_is_member(dependency->attributes[n_dep_vars], attnums));
+					if (n_dep_vars < 0 && (!strongest || strongest->degree < dependency->degree))
+						strongest = dependency;
+				}
+			}
+			if (strongest)
+				return strongest->degree;
+		}
+	}
+	return 0.0;
+}
+
+/*
  * clauselist_selectivity -
  *	  Compute the selectivity of an implicitly-ANDed list of boolean
  *	  expression clauses.  The list can be empty, in which case 1.0
@@ -129,6 +172,9 @@ clauselist_selectivity_ext(PlannerInfo *root,
 	RangeQueryClause *rqlist = NULL;
 	ListCell   *l;
 	int			listidx;
+	Bitmapset  *clauses_attnums = NULL;
+	int			n_clauses_attnums = 0;
+	int         innerRelid = varRelid;
 
 	/*
 	 * If there's exactly one clause, just go directly to
@@ -139,6 +185,9 @@ clauselist_selectivity_ext(PlannerInfo *root,
 									  varRelid, jointype, sjinfo,
 									  use_extended_stats);
 
+	if (innerRelid == 0 && sjinfo)
+		bms_get_singleton_member(sjinfo->min_righthand, &innerRelid);
+
 	/*
 	 * Determine if these clauses reference a single relation.  If so, and if
 	 * it has extended statistics, try to apply those.
@@ -171,7 +220,6 @@ clauselist_selectivity_ext(PlannerInfo *root,
 		Node	   *clause = (Node *) lfirst(l);
 		RestrictInfo *rinfo;
 		Selectivity s2;
-
 		listidx++;
 
 		/*
@@ -204,6 +252,7 @@ clauselist_selectivity_ext(PlannerInfo *root,
 		else
 			rinfo = NULL;
 
+
 		/*
 		 * See if it looks like a restriction clause with a pseudoconstant on
 		 * one side.  (Anything more complicated than that might not behave in
@@ -215,6 +264,42 @@ clauselist_selectivity_ext(PlannerInfo *root,
 			OpExpr	   *expr = (OpExpr *) clause;
 			bool		varonleft = true;
 			bool		ok;
+			int         oprrest = get_oprrest(expr->opno);
+
+			/* Try to take in account functional dependencies between attributes */
+			if (oprrest == F_EQSEL && rinfo != NULL && innerRelid != 0)
+			{
+				Var* var = (Var*)linitial(expr->args);
+				if (!IsA(var, Var) || var->varno != innerRelid)
+				{
+					var = (Var*)lsecond(expr->args);
+				}
+				if (IsA(var, Var) && var->varattno >= 0 && var->varno == innerRelid)
+				{
+					clauses_attnums = bms_add_member(clauses_attnums, var->varattno);
+					if (n_clauses_attnums++ != 0)
+					{
+						double dep = find_var_dependency(root, innerRelid, var, clauses_attnums);
+						if (dep != 0.0)
+						{
+							/*
+							 * Replace s2 with the conditional probability s2 given s1, computed
+							 * using the formula P(b|a) = P(a,b) / P(a), which simplifies to
+							 *
+							 * P(b|a) = f * Min(P(a), P(b)) / P(a) + (1-f) * P(b)
+							 *
+							 * where P(a) = s1, the selectivity of the implying attributes, and
+							 * P(b) = s2, the selectivity of the implied attribute.
+							 */
+							if (s1 <= s2)
+								s1 *= dep + (1 - dep) * s2;
+							else
+								s1 *= dep * s2 / s1 + (1 - dep) * s2;
+							continue;
+						}
+					}
+				}
+			}
 
 			if (rinfo)
 			{
@@ -240,7 +325,7 @@ clauselist_selectivity_ext(PlannerInfo *root,
 				 * selectivity in generically.  But if it's the right oprrest,
 				 * add the clause to rqlist for later processing.
 				 */
-				switch (get_oprrest(expr->opno))
+				switch (oprrest)
 				{
 					case F_SCALARLTSEL:
 					case F_SCALARLESEL:
diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
index aab06c7..74ceaf5 100644
--- a/src/backend/optimizer/path/costsize.c
+++ b/src/backend/optimizer/path/costsize.c
@@ -4790,6 +4790,30 @@ get_parameterized_joinrel_size(PlannerInfo *root, RelOptInfo *rel,
 	return nrows;
 }
 
+
+/*
+ * Try to find dependency between variables.
+ * var: varaibles which dependencies are considered
+ * clause_vars: list of variables used in other clauses
+ * This functions return strongest dependency and some subset of variables from the same relation
+ * or 0.0 if no dependency was found.
+ */
+static double
+var_depends_on(PlannerInfo *root, Var* var, List* clause_vars)
+{
+	ListCell* lc;
+	Bitmapset *attnums = NULL;
+	Index relid = var->varno;
+
+	foreach (lc, clause_vars)
+	{
+		Var* join_var = (Var*)lfirst(lc);
+		if (join_var->varno == relid && join_var->varattno >= 0)
+			attnums = bms_add_member(attnums, join_var->varattno);
+	}
+	return attnums ? find_var_dependency(root, relid, var, bms_add_member(attnums, var->varattno)) : 0.0;
+}
+
 /*
  * calc_joinrel_size_estimate
  *		Workhorse for set_joinrel_size_estimates and
@@ -4886,6 +4910,39 @@ calc_joinrel_size_estimate(PlannerInfo *root,
 		pselec = 0.0;			/* not used, keep compiler quiet */
 	}
 
+	/* Try to take in account functional dependencies between attributes of clauses pushed-down to joined relations and
+	 * retstrictlist clause. Right now we consider only case of restrictlist consists of one clause.
+	 */
+	if (list_length(restrictlist) == 1)
+	{
+		RestrictInfo* rinfo = linitial(restrictlist);
+		Expr* clause = rinfo->clause;
+
+		Assert(IsA(rinfo, RestrictInfo));
+
+		if (is_opclause(clause))
+		{
+			OpExpr *expr = (OpExpr *) clause;
+			ListCell* lc;
+			List* join_vars = NULL;
+
+			/* Get list of all attributes in pushed-down clauses */
+			foreach (lc, outer_rel->baserestrictinfo)
+				join_vars = list_concat(join_vars, pull_vars_of_level((Node*)((RestrictInfo*)lfirst(lc))->clause, 0));
+			foreach (lc, inner_rel->baserestrictinfo)
+				join_vars = list_concat(join_vars, pull_vars_of_level((Node*)((RestrictInfo*)lfirst(lc))->clause, 0));
+
+			foreach (lc, expr->args)
+			{
+				Var *var = (Var*) lfirst(lc);
+				if (IsA(var, Var) && var->varattno >= 0)
+				{
+					double dep = var_depends_on(root, var, join_vars);
+					jselec = jselec*(1.0 - dep) + dep;
+				}
+			}
+		}
+	}
 	/*
 	 * Basically, we multiply size of Cartesian product by selectivity.
 	 *
diff --git a/src/include/optimizer/clauses.h b/src/include/optimizer/clauses.h
index 0673887..f7d91e7 100644
--- a/src/include/optimizer/clauses.h
+++ b/src/include/optimizer/clauses.h
@@ -53,4 +53,6 @@ extern void CommuteOpExpr(OpExpr *clause);
 extern Query *inline_set_returning_function(PlannerInfo *root,
 											RangeTblEntry *rte);
 
+extern double find_var_dependency(PlannerInfo *root, Index relid, Var *var, Bitmapset *attnums);
+
 #endif							/* CLAUSES_H */

Reply via email to