Re: [HACKERS] nested loop semijoin estimates

Tomas Vondra Sat, 30 May 2015 10:18:06 -0700

On 05/30/15 03:52, Tomas Vondra wrote:



On 05/30/15 01:20, Tomas Vondra wrote:


Notice the cost - it's way lover than the previous plan (9.2 vs
~111k), yet this plan was not chosen. So either the change broke
something (e.g. by violating some optimizer assumption), or maybe
there's a bug somewhere else ...


After a bit more investigation, what I think is happening here is
add_path() does not realize this is a SEMI join and a single tuple is
enough, and discards the simple indexscan path in favor of the bitmap
index scan as that seems cheaper when scanning everything.

So not really a bug, but maybe 'consider_startup' would help?

A bit more info - the reason why the IndexPath gets discarded (in favorof BitmapHeapScan) seems to be twofold:


(a) Semijoin does no imply 'consider_startup' for the inner path, so it
    gets discarded by the BitmapHeapScan, as it has lower total cost.

    I'm not sure where exactly the consider_startup should be set, so
    I've forced that at the beginning on add_path() to see what happens.
    Sadly, this alone is not sufficient, because of the next point.

(b) To actually use startup costs in compare_path_costs_fuzzily, there
    is an additional condition about handling parameterized paths, as
    explained in this comment:

    * This function also includes special hacks to support a policy
    * enforced by its sole caller, add_path(): paths that have any
    * parameterization cannot win comparisons on the grounds of having
    * cheaper startup cost, since we deem only total cost to be of
    * interest for a parameterized path. (Unparameterized paths are
    * more common, so we check for this case last.)

    so the startup_cost comparisons look like this:

    if (consider_startup &&
        path2->startup_cost > path1->startup_cost * fuzz_factor &&
        path1->param_info == NULL)

    Of course, path1 (IndexPath) actually has parameters, which makes
    the whole condition false, and the IndexPath gets discarded :-(

    ISTM the reported test case seems like a counter example to the
    policy, so maybe this should really be relaxed somehow.

To see what happens, I tweaked both (a) and (b) by forcing

    consider_startup = true

at the beginning of add_path(), and removing the (param_info==NULL)conditions (patch attached). It also includes the run_cost tweakmentioned in the first message.

I'm not claiming this is the correct fix, the only goal was to see whathappens. And it really does produce the 'right' plan:


                             QUERY PLAN
-----------------------------------------------------------------------
 Nested Loop Semi Join  (cost=0.99..9.20 rows=1 width=74)
   ->  Index Scan using term_facttablename_columnname_idx on term t
         (cost=0.55..8.57 rows=1 width=74)
         Index Cond: (((facttablename)::text =
              'facttable_stat_fta4'::text) AND ((columnname)::text =
              'berechnungsart'::text))
   ->  Index Only Scan using facttable_stat_fta4_berechnungsart_idx on
       facttable_stat_fta4 f (cost=0.43..310525.02 rows=4870496 width=2)
         Index Cond: (berechnungsart = (t.term)::text)
(5 rows)

If I get the comments in pathnode.c about the startup_cost policy, it's"only" an optimization to reduce the number of parameterized paths, so Ibelieve this does not break anything. Also, it's no fun if the planningis a tad faster, but the query runtime is 1000x higher.


Ideas how to relax the policy without significant impact on optimizer?

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
index ac865be..368bed4 100644
--- a/src/backend/optimizer/path/costsize.c
+++ b/src/backend/optimizer/path/costsize.c
@@ -1836,11 +1836,11 @@ initial_cost_nestloop(PlannerInfo *root, JoinCostWorkspace *workspace,
 		 * estimate whether that will happen, so be conservative and always
 		 * charge the whole first-scan cost once.
 		 */
-		run_cost += inner_run_cost;
-
 		outer_matched_rows = rint(outer_path_rows * semifactors->outer_match_frac);
 		inner_scan_frac = 2.0 / (semifactors->match_count + 1.0);
 
+		run_cost += inner_run_cost * inner_scan_frac;
+
 		/* Add inner run cost for additional outer tuples having matches */
 		if (outer_matched_rows > 1)
 			run_cost += (outer_matched_rows - 1) * inner_rescan_run_cost * inner_scan_frac;
diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
index 7f7aa24..aa97d4f 100644
--- a/src/backend/optimizer/util/pathnode.c
+++ b/src/backend/optimizer/util/pathnode.c
@@ -159,8 +159,7 @@ compare_path_costs_fuzzily(Path *path1, Path *path2, double fuzz_factor,
 	{
 		/* path1 fuzzily worse on total cost */
 		if (consider_startup &&
-			path2->startup_cost > path1->startup_cost * fuzz_factor &&
-			path1->param_info == NULL)
+			path2->startup_cost > path1->startup_cost * fuzz_factor)
 		{
 			/* ... but path2 fuzzily worse on startup, so DIFFERENT */
 			return COSTS_DIFFERENT;
@@ -172,8 +171,7 @@ compare_path_costs_fuzzily(Path *path1, Path *path2, double fuzz_factor,
 	{
 		/* path2 fuzzily worse on total cost */
 		if (consider_startup &&
-			path1->startup_cost > path2->startup_cost * fuzz_factor &&
-			path2->param_info == NULL)
+			path1->startup_cost > path2->startup_cost * fuzz_factor)
 		{
 			/* ... but path1 fuzzily worse on startup, so DIFFERENT */
 			return COSTS_DIFFERENT;
@@ -183,14 +181,12 @@ compare_path_costs_fuzzily(Path *path1, Path *path2, double fuzz_factor,
 	}
 	/* fuzzily the same on total cost */
 	/* (so we may as well compare startup cost, even if !consider_startup) */
-	if (path1->startup_cost > path2->startup_cost * fuzz_factor &&
-		path2->param_info == NULL)
+	if (path1->startup_cost > path2->startup_cost * fuzz_factor)
 	{
 		/* ... but path1 fuzzily worse on startup, so path2 wins */
 		return COSTS_BETTER2;
 	}
-	if (path2->startup_cost > path1->startup_cost * fuzz_factor &&
-		path1->param_info == NULL)
+	if (path2->startup_cost > path1->startup_cost * fuzz_factor)
 	{
 		/* ... but path2 fuzzily worse on startup, so path1 wins */
 		return COSTS_BETTER1;
@@ -401,6 +397,9 @@ add_path(RelOptInfo *parent_rel, Path *new_path)
 	ListCell   *p1_prev;
 	ListCell   *p1_next;
 
+	/* XXX force startup cost comparison */
+	parent_rel->consider_startup = true;
+
 	/*
 	 * This is a convenient place to check for query cancel --- no part of the
 	 * planner goes very long without calling add_path().

-- 
Sent via pgsql-hackers mailing list ([email protected])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] nested loop semijoin estimates

Reply via email to