On 21.07.2025 16:55, Ilia Evdokimov wrote:
While analyzing planner performance on JOB with
default_statistics_target = 1000, I noticed that a significant portion
of planning time is spent inside the eqjoinsel() function. According
to perf, in most JOB queries at default_statistics_target = 1000,
eqjoinsel() is the most expensive function during planning, accounting
for approximately 8% of total CPU time. At default_statistics_target =
10000, the planner spend up to 75% of its time inside eqjoinsel(),
making it one of the primary bottlenecks.
This overhead is caused by the O(N^2) nested-loop comparison of MCVs
in var1 = var2 clauses.
I propose an optimization: when the column datatype supports
ordering(i.e., has < and >), we can sort both MCV lists and apply
mege-style algorithm to detect matches. This reduces runtime from
O(N^2) to O(NlogN), where N is the number of MCV entries. The patch
also applies the same optimization to semi-join clauses, which show
similar performance behavior.
Following up on my previous message about optimizing eqjoinsel() for
Var1 = Var2 and semijoin clauses, I’d like to share more detailed
performance results across different values of default_statistics_target
on the JOB benchmark.
The performance improvement grows as the number of MCV entries increases
(i.e., with higher default_statistics_target). The table below shows
total planner time summed over all 113 queries in JOB for each setting
of default_statistics_target, before and after applying patch:
Total planner time across all JOB queries
=========================================
default_statistics_target | Before Patch (ms) | After Patch (ms)
--------------------------+-------------------+------------------
100 | 1828.433 | 1820.556
1000 | 2194.282 | 1963.110
2500 | 4606.705 | 2140.126
5000 | 16661.581 | 2616.109
7500 | 35988.569 | 3061.161
10000 | 66616.620 | 3504.144
For default_statistics_target < 1000, the planning time remains the same
before and after the patch. The optimization reduces planner
time substantially - by up to *13X *at default_statistics_target = 10000
- while the generated plans and selectivity calculations remain
unchanged. To illustrate this, the table below shows the 10 slowest JOB
queries (by planning time), along with their planning times before and
after applying the patch.
Top 10 slowest queries at default_statistics_target = 10000
===========================================================
Query | Before Patch (ms) | After Patch (ms)
------+--------------------+-------------------
29c | 1939.282 | 111.219
29b | 1939.237 | 100.109
29a | 1931.870 | 100.224
31c | 1622.255 | 67.609
30c | 1602.156 | 70.942
28b | 1521.026 | 84.058
30b | 1519.972 | 68.777
30a | 1518.014 | 70.529
28a | 1514.908 | 86.662
31a | 1507.303 | 63.579
As shown, the total planner time for these top 10 queries drops
substantially with the optimization.
I’ve identified and fixed two issues in the original v1 patch: In
'eqjoinsel_semi' the second MCV array was allocated with an incorrect
size. And the initialization of FunctionCallInfoData was moved outside
the comparator compare_mcv_items() to avoid repeated setup during
sorting. I've attached the updated v2 patch with changes.
Any suggestions?
--
Best regards,
Ilia Evdokimov,
Tantor Labs LLC.
From 56cfbc9f93b69f877d7e9b350bfbce51c1908802 Mon Sep 17 00:00:00 2001
From: Ilia Evdokimov <ilya.evdoki...@tantorlabs.ru>
Date: Mon, 28 Jul 2025 13:34:05 +0300
Subject: [PATCH v2] Optimize selectivity estimation for Var = Var clauses
using merge-based MCV comparison
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
Previously, selectivity estimation for join and semi-join clauses of the form
Var = Var relied on a nested-loop comparison of MCV lists, resulting in O(N^2)
behavior with respect to the number of common values.
This patch introduces a merge-based strategy when the underlying type supports
BTREE ordering. In such cases, the MCVs are sorted using the datatype-specific
comparison operator, and selectivity is estimated via linear-time merge scan.
---
src/backend/utils/adt/selfuncs.c | 302 ++++++++++++++++++++++------
src/backend/utils/cache/lsyscache.c | 61 ++++++
src/include/utils/lsyscache.h | 1 +
src/include/utils/selfuncs.h | 8 +
4 files changed, 312 insertions(+), 60 deletions(-)
diff --git a/src/backend/utils/adt/selfuncs.c b/src/backend/utils/adt/selfuncs.c
index 17fbfa9b410..6bcec4d0e59 100644
--- a/src/backend/utils/adt/selfuncs.c
+++ b/src/backend/utils/adt/selfuncs.c
@@ -217,6 +217,7 @@ static bool get_actual_variable_endpoint(Relation heapRel,
static RelOptInfo *find_join_input_rel(PlannerInfo *root, Relids relids);
static double btcost_correlation(IndexOptInfo *index,
VariableStatData *vardata);
+static int compare_mcv_items(const void *a, const void *b, void *arg);
/*
@@ -2463,7 +2464,6 @@ eqjoinsel_inner(Oid opfuncoid, Oid collation,
* of Wisconsin, Madison, March 1991 (available from ftp.cs.wisc.edu).
*/
LOCAL_FCINFO(fcinfo, 2);
- FmgrInfo eqproc;
bool *hasmatch1;
bool *hasmatch2;
double nullfrac1 = stats1->stanullfrac;
@@ -2479,52 +2479,134 @@ eqjoinsel_inner(Oid opfuncoid, Oid collation,
totalsel2;
int i,
nmatches;
-
- fmgr_info(opfuncoid, &eqproc);
-
- /*
- * Save a few cycles by setting up the fcinfo struct just once. Using
- * FunctionCallInvoke directly also avoids failure if the eqproc
- * returns NULL, though really equality functions should never do
- * that.
- */
- InitFunctionCallInfoData(*fcinfo, &eqproc, 2, collation,
- NULL, NULL);
- fcinfo->args[0].isnull = false;
- fcinfo->args[1].isnull = false;
+ Oid compare_func_oid = InvalidOid;
hasmatch1 = (bool *) palloc0(sslot1->nvalues * sizeof(bool));
hasmatch2 = (bool *) palloc0(sslot2->nvalues * sizeof(bool));
/*
* Note we assume that each MCV will match at most one member of the
- * other MCV list. If the operator isn't really equality, there could
- * be multiple matches --- but we don't look for them, both for speed
- * and because the math wouldn't add up...
+ * other MCV list.
*/
matchprodfreq = 0.0;
nmatches = 0;
- for (i = 0; i < sslot1->nvalues; i++)
+ if (get_comparison_op_for_sort(opfuncoid, &compare_func_oid))
{
- int j;
+ LOCAL_FCINFO(merge_fcinfo, 2);
+ FmgrInfo cmpproc;
+ McvItem *mcvs1,
+ *mcvs2;
+ int ptr1,
+ ptr2;
+
+ fmgr_info(compare_func_oid, &cmpproc);
+
+ InitFunctionCallInfoData(*merge_fcinfo, &cmpproc, 2,
+ collation, NULL, NULL);
- fcinfo->args[0].value = sslot1->values[i];
+ mcvs1 = (McvItem *) palloc(sslot1->nvalues * sizeof(McvItem));
+ mcvs2 = (McvItem *) palloc(sslot2->nvalues * sizeof(McvItem));
- for (j = 0; j < sslot2->nvalues; j++)
+ /*
+ * Copy the MCV values and their frequencies into a temporary array
+ * so we can safely sort them without altering the original statistics.
+ */
+ for (i = 0; i < sslot1->nvalues; i++)
{
- Datum fresult;
+ mcvs1[i].value = sslot1->values[i];
+ mcvs1[i].original_idx = i;
+ mcvs1[i].frequency = sslot1->numbers[i];
+ }
- if (hasmatch2[j])
- continue;
- fcinfo->args[1].value = sslot2->values[j];
- fcinfo->isnull = false;
- fresult = FunctionCallInvoke(fcinfo);
- if (!fcinfo->isnull && DatumGetBool(fresult))
- {
- hasmatch1[i] = hasmatch2[j] = true;
- matchprodfreq += sslot1->numbers[i] * sslot2->numbers[j];
+ for (i = 0; i < sslot2->nvalues; i++)
+ {
+ mcvs2[i].value = sslot2->values[i];
+ mcvs2[i].original_idx = i;
+ mcvs2[i].frequency = sslot2->numbers[i];
+ }
+
+ qsort_arg(mcvs1, sslot1->nvalues, sizeof(McvItem), compare_mcv_items, &merge_fcinfo);
+ qsort_arg(mcvs2, sslot2->nvalues, sizeof(McvItem), compare_mcv_items, &merge_fcinfo);
+
+ ptr1 = ptr2 = 0;
+ while (ptr1 < sslot1->nvalues && ptr2 < sslot2->nvalues)
+ {
+ McvItem *item1 = &mcvs1[ptr1];
+ McvItem *item2 = &mcvs2[ptr2];
+ Datum result_datum;
+ int cmp_result;
+
+ merge_fcinfo->args[0].value = item1->value;
+ merge_fcinfo->args[0].isnull = false;
+ merge_fcinfo->args[1].value = item2->value;
+ merge_fcinfo->args[1].isnull = false;
+
+ result_datum = FunctionCallInvoke(merge_fcinfo);
+ cmp_result = DatumGetInt32(result_datum);
+
+ if (cmp_result == 0)
+ {
+ hasmatch1[item1->original_idx] = true;
+ hasmatch2[item2->original_idx] = true;
+
+ matchprodfreq += item1->frequency * item2->frequency;
nmatches++;
- break;
+
+ ptr1++;
+ ptr2++;
+ }
+ else if (cmp_result < 0)
+ ptr1++;
+ else
+ ptr2++;
+ }
+
+ pfree(mcvs1);
+ pfree(mcvs2);
+ }
+ else
+ {
+ FmgrInfo eqproc;
+
+ fmgr_info(opfuncoid, &eqproc);
+ /*
+ * Save a few cycles by setting up the fcinfo struct just once. Using
+ * FunctionCallInvoke directly also avoids failure if the eqproc
+ * returns NULL, though really equality functions should never do
+ * that.
+ */
+ InitFunctionCallInfoData(*fcinfo, &eqproc, 2, collation,
+ NULL, NULL);
+ fcinfo->args[0].isnull = false;
+ fcinfo->args[1].isnull = false;
+
+ /*
+ * If the operator isn't really equality, there could
+ * be multiple matches --- but we don't look for them, both for speed
+ * and because the math wouldn't add up...
+ */
+ for (i = 0; i < sslot1->nvalues; i++)
+ {
+ int j;
+
+ fcinfo->args[0].value = sslot1->values[i];
+
+ for (j = 0; j < sslot2->nvalues; j++)
+ {
+ Datum fresult;
+
+ if (hasmatch2[j])
+ continue;
+ fcinfo->args[1].value = sslot2->values[j];
+ fcinfo->isnull = false;
+ fresult = FunctionCallInvoke(fcinfo);
+ if (!fcinfo->isnull && DatumGetBool(fresult))
+ {
+ hasmatch1[i] = hasmatch2[j] = true;
+ matchprodfreq += sslot1->numbers[i] * sslot2->numbers[j];
+ nmatches++;
+ break;
+ }
}
}
}
@@ -2690,7 +2772,6 @@ eqjoinsel_semi(Oid opfuncoid, Oid collation,
* in a skewed distribution this gives us a big leg up in accuracy.
*/
LOCAL_FCINFO(fcinfo, 2);
- FmgrInfo eqproc;
bool *hasmatch1;
bool *hasmatch2;
double nullfrac1 = stats1->stanullfrac;
@@ -2700,6 +2781,7 @@ eqjoinsel_semi(Oid opfuncoid, Oid collation,
int i,
nmatches,
clamped_nvalues2;
+ Oid compare_func_oid = InvalidOid;
/*
* The clamping above could have resulted in nd2 being less than
@@ -2710,19 +2792,6 @@ eqjoinsel_semi(Oid opfuncoid, Oid collation,
*/
clamped_nvalues2 = Min(sslot2->nvalues, nd2);
- fmgr_info(opfuncoid, &eqproc);
-
- /*
- * Save a few cycles by setting up the fcinfo struct just once. Using
- * FunctionCallInvoke directly also avoids failure if the eqproc
- * returns NULL, though really equality functions should never do
- * that.
- */
- InitFunctionCallInfoData(*fcinfo, &eqproc, 2, collation,
- NULL, NULL);
- fcinfo->args[0].isnull = false;
- fcinfo->args[1].isnull = false;
-
hasmatch1 = (bool *) palloc0(sslot1->nvalues * sizeof(bool));
hasmatch2 = (bool *) palloc0(clamped_nvalues2 * sizeof(bool));
@@ -2733,26 +2802,116 @@ eqjoinsel_semi(Oid opfuncoid, Oid collation,
* and because the math wouldn't add up...
*/
nmatches = 0;
- for (i = 0; i < sslot1->nvalues; i++)
+ if (get_comparison_op_for_sort(opfuncoid, &compare_func_oid))
{
- int j;
+ LOCAL_FCINFO(merge_fcinfo, 2);
+ FmgrInfo cmpproc;
+ McvItem *mcvs1,
+ *mcvs2;
+ int ptr1,
+ ptr2;
- fcinfo->args[0].value = sslot1->values[i];
+ fmgr_info(compare_func_oid, &cmpproc);
+ InitFunctionCallInfoData(*merge_fcinfo, &cmpproc, 2,
+ collation, NULL, NULL);
- for (j = 0; j < clamped_nvalues2; j++)
+ mcvs1 = (McvItem *) palloc(sslot1->nvalues * sizeof(McvItem));
+ mcvs2 = (McvItem *) palloc(clamped_nvalues2 * sizeof(McvItem));
+
+ /*
+ * Copy the MCV values and their frequencies into a temporary array
+ * so we can safely sort them without altering the original statistics.
+ */
+ for (i = 0; i < sslot1->nvalues; i++)
{
- Datum fresult;
+ mcvs1[i].value = sslot1->values[i];
+ mcvs1[i].original_idx = i;
+ mcvs1[i].frequency = sslot1->numbers[i];
+ }
- if (hasmatch2[j])
- continue;
- fcinfo->args[1].value = sslot2->values[j];
- fcinfo->isnull = false;
- fresult = FunctionCallInvoke(fcinfo);
- if (!fcinfo->isnull && DatumGetBool(fresult))
- {
- hasmatch1[i] = hasmatch2[j] = true;
+ for (i = 0; i < clamped_nvalues2; i++)
+ {
+ mcvs2[i].value = sslot2->values[i];
+ mcvs2[i].original_idx = i;
+ mcvs2[i].frequency = sslot2->numbers[i];
+ }
+
+ qsort_arg(mcvs1, sslot1->nvalues, sizeof(McvItem), compare_mcv_items, &merge_fcinfo);
+ qsort_arg(mcvs2, clamped_nvalues2, sizeof(McvItem), compare_mcv_items, &merge_fcinfo);
+
+ ptr1 = ptr2 = 0;
+ while (ptr1 < sslot1->nvalues && ptr2 < clamped_nvalues2)
+ {
+ McvItem *item1 = &mcvs1[ptr1];
+ McvItem *item2 = &mcvs2[ptr2];
+ Datum result_datum;
+ int cmp_result;
+
+ merge_fcinfo->args[0].value = item1->value;
+ merge_fcinfo->args[0].isnull = false;
+ merge_fcinfo->args[1].value = item2->value;
+ merge_fcinfo->args[1].isnull = false;
+
+ result_datum = FunctionCallInvoke(merge_fcinfo);
+ cmp_result = DatumGetInt32(result_datum);
+
+ if (cmp_result == 0)
+ {
+ hasmatch1[item1->original_idx] = true;
+ hasmatch2[item2->original_idx] = true;
+
nmatches++;
- break;
+
+ ptr1++;
+ ptr2++;
+ }
+ else if (cmp_result < 0)
+ ptr1++;
+ else
+ ptr2++;
+ }
+
+ pfree(mcvs1);
+ pfree(mcvs2);
+ }
+ else
+ {
+ FmgrInfo eqproc;
+
+ fmgr_info(opfuncoid, &eqproc);
+ InitFunctionCallInfoData(*fcinfo, &eqproc, 2, collation,
+ NULL, NULL);
+
+ fcinfo->args[0].isnull = false;
+ fcinfo->args[1].isnull = false;
+
+ /*
+ * Save a few cycles by setting up the fcinfo struct just once. Using
+ * FunctionCallInvoke directly also avoids failure if the eqproc
+ * returns NULL, though really equality functions should never do
+ * that.
+ */
+ for (i = 0; i < sslot1->nvalues; i++)
+ {
+ int j;
+
+ fcinfo->args[0].value = sslot1->values[i];
+
+ for (j = 0; j < clamped_nvalues2; j++)
+ {
+ Datum fresult;
+
+ if (hasmatch2[j])
+ continue;
+ fcinfo->args[1].value = sslot2->values[j];
+ fcinfo->isnull = false;
+ fresult = FunctionCallInvoke(fcinfo);
+ if (!fcinfo->isnull && DatumGetBool(fresult))
+ {
+ hasmatch1[i] = hasmatch2[j] = true;
+ nmatches++;
+ break;
+ }
}
}
}
@@ -8753,3 +8912,26 @@ brincostestimate(PlannerInfo *root, IndexPath *path, double loop_count,
*indexPages = index->pages;
}
+
+/*
+ * Comparator for sorting MCV items
+ */
+static int
+compare_mcv_items(const void *a, const void *b, void *arg)
+{
+ FunctionCallInfo fcinfo_comp = *((FunctionCallInfo *) arg);
+ Datum result_datum;
+
+ const McvItem *item1 = (const McvItem *) a;
+ const McvItem *item2 = (const McvItem *) b;
+
+ fcinfo_comp->args[0].value = item1->value;
+ fcinfo_comp->args[0].isnull = false;
+
+ fcinfo_comp->args[1].value = item2->value;
+ fcinfo_comp->args[1].isnull = false;
+
+ result_datum = FunctionCallInvoke(fcinfo_comp);
+
+ return DatumGetInt32(result_datum);
+}
\ No newline at end of file
diff --git a/src/backend/utils/cache/lsyscache.c b/src/backend/utils/cache/lsyscache.c
index c460a72b75d..1a73a964bc4 100644
--- a/src/backend/utils/cache/lsyscache.c
+++ b/src/backend/utils/cache/lsyscache.c
@@ -17,6 +17,7 @@
#include "access/hash.h"
#include "access/htup_details.h"
+#include "access/nbtree.h"
#include "bootstrap/bootstrap.h"
#include "catalog/namespace.h"
#include "catalog/pg_am.h"
@@ -39,6 +40,7 @@
#include "catalog/pg_subscription.h"
#include "catalog/pg_transform.h"
#include "catalog/pg_type.h"
+#include "commands/defrem.h"
#include "miscadmin.h"
#include "nodes/makefuncs.h"
#include "utils/array.h"
@@ -315,6 +317,65 @@ get_ordering_op_properties(Oid opno,
return result;
}
+/*
+ * get_comparison_op_for_sort
+ * Get the OID of a datatype-specific comparison operator
+ * associated with a given equality operator.
+ *
+ * This is typically used in clauses like Var = Var, where both arguments
+ * are of the same type. We look up the corresponding BTORDER_PROC for
+ * the default B-tree opclass of the datatype.
+ *
+ * However, for special cases like array_eq and record_eq, a comparison
+ * operator may exist (btarraycmp, btrecordcmp), but sorting is only
+ * possible if all element or field types are themselves sortable.
+ * Rather than reimplement that logic here, we simply exclude such
+ * cases by checking for btarraycmp and btrecordcmp explicitly.
+ *
+ * Returns false if no suitable comparison operator is found.
+ */
+bool
+get_comparison_op_for_sort(Oid opno, Oid *cmpopno)
+{
+ bool result = false;
+ HeapTuple proctup;
+
+ proctup = SearchSysCache1(PROCOID, ObjectIdGetDatum(opno));
+ if (HeapTupleIsValid(proctup))
+ {
+ Form_pg_proc procform = (Form_pg_proc) GETSTRUCT(proctup);
+
+ /* Only consider binary operators where both arguments are of the same type. */
+ if (procform->pronargs == 2 &&
+ OidIsValid(procform->proargtypes.values[0]) &&
+ OidIsValid(procform->proargtypes.values[1]) &&
+ procform->proargtypes.values[0] == procform->proargtypes.values[1])
+ {
+ Oid type_oid = procform->proargtypes.values[0];
+ Oid opclass_oid = GetDefaultOpClass(type_oid, BTREE_AM_OID);
+
+ if (OidIsValid(opclass_oid))
+ {
+ Oid opfamily_oid = get_opclass_family(opclass_oid);
+
+ *cmpopno = get_opfamily_proc(opfamily_oid, type_oid, type_oid, BTORDER_PROC);
+
+ /*
+ * Skip array_eq and record_eq: their comparison functions exist,
+ * but sorting is only possible if all elements or fields are sortable.
+ */
+ if (OidIsValid(*cmpopno) &&
+ *cmpopno != F_BTARRAYCMP &&
+ *cmpopno != F_BTRECORDCMP)
+ result = true;
+ }
+ }
+ ReleaseSysCache(proctup);
+ }
+
+ return result;
+}
+
/*
* get_equality_op_for_ordering_op
* Get the OID of the datatype-specific equality operator
diff --git a/src/include/utils/lsyscache.h b/src/include/utils/lsyscache.h
index fa7c7e0323b..3ddcca5ac08 100644
--- a/src/include/utils/lsyscache.h
+++ b/src/include/utils/lsyscache.h
@@ -80,6 +80,7 @@ extern Oid get_opfamily_member_for_cmptype(Oid opfamily, Oid lefttype, Oid right
extern bool get_ordering_op_properties(Oid opno,
Oid *opfamily, Oid *opcintype, CompareType *cmptype);
extern Oid get_equality_op_for_ordering_op(Oid opno, bool *reverse);
+extern bool get_comparison_op_for_sort(Oid opno, Oid *cmpopno);
extern Oid get_ordering_op_for_equality_op(Oid opno, bool use_lhs_type);
extern List *get_mergejoin_opfamilies(Oid opno);
extern bool get_compatible_hash_operators(Oid opno,
diff --git a/src/include/utils/selfuncs.h b/src/include/utils/selfuncs.h
index 013049b3098..66505d3b79c 100644
--- a/src/include/utils/selfuncs.h
+++ b/src/include/utils/selfuncs.h
@@ -137,6 +137,14 @@ typedef struct
double num_sa_scans; /* # indexscans from ScalarArrayOpExprs */
} GenericCosts;
+/* Represents an MCV statistic entry */
+typedef struct McvItem
+{
+ Datum value;
+ int original_idx; /* The original index of the item in the array before sorting. */
+ double frequency; /* Frequency of value in statistics */
+} McvItem;
+
/* Hooks for plugins to get control when we ask for stats */
typedef bool (*get_relation_stats_hook_type) (PlannerInfo *root,
RangeTblEntry *rte,
--
2.34.1