Re: [HACKERS] PATCH: use foreign keys to improve join estimates v1

David Rowley Tue, 22 Sep 2015 22:12:16 -0700

On 20 August 2015 at 13:49, Tomas Vondra <[email protected]>
wrote:


> attached is a significantly reworked patch for using the foreign keys in
> selectivity estimation.
>

Thanks for working a new patch, I'm starting to look at it again now:

Here's my thoughts so far:

+ /*
+ * TODO Can we do something like (hasindex) here? Is it necessary? The
+ *      trouble with that is that we don't have a good place to reset that
+ *      flag (relhasindex is reset by vacuum, but is has nothing to do with
+ *      foreign keys at this point).
+ */
+ if (true)

You don't seem to have addressed this part yet.

I mentioned previously that I had attempted this already. Here was the
response
http://www.postgresql.org/message-id/[email protected]

Please just remove the comment and if (true).  By today's
standards relhasindex would never be added.
Does it not just serve as an optimization only when the rel is not cached
anyway?

--- a/src/backend/optimizer/plan/analyzejoins.c
+++ b/src/backend/optimizer/plan/analyzejoins.c
@@ -562,7 +562,6 @@ remove_rel_from_joinlist(List *joinlist, int relid, int
*nremoved)
  return result;
 }

-
 /*

Accidental change, please remove.


+static Selectivity
+clauselist_join_selectivity(PlannerInfo *root, List *joinquals, int varno,
+ JoinType jointype, SpecialJoinInfo *sjinfo)

varno is not used.

+ /*
+ * First we'll identify foreign keys that are fully matched by the join
+ * clauses, and we'll update the selectivity accordingly while doing so.
+ */
+ fkeys = find_satisfied_fkeys(root, sjinfo, joinquals, &sel);
+
+ /*
+ * Now that we have the foreign keys, we can get rid of the clauses
+ * matching any of them, and only keep the remaining clauses, so that
+ * we can estimate them using the regular selectivity estimation.
+ */
+ unmatched = filter_fk_join_clauses(root, sjinfo, fkeys, joinquals);

This seems a bit convoluted and inefficient.
Why not just return a bitmask of the matched items indexed by the list
position?
Doing it this way you're having to match the expressions twice. Once seems
fine.
Did you read my review about looking for the longest matches by counting
the bits in the mask?

+ * Another slightly strange case is FK constraints in both directions
(these
+ * statements don't work - the foreign keys need to be established using
+ * ALTER, but for illustration it's sufficient).
+ *
+ * CREATE TABLE a (
+ *     a1 INT,
+ *     a2 INT,
+ *     UNIQUE (a1, a2),
+ *     FOREIGN KEY (a1, a2) REFERENCES a (b1, b2)
+ * );

Did you perhaps mean b? i.e. REFERENCES b (b1, b2) ?

Same is duplicated further down.

+ *sel *= (1.0 / rel_outer->tuples);

I think this needs to be :

+ *sel *= (1.0 / Max(rel_outer->tuples, 1.0));

As the following causes a divide by zero.

See attached divide_by_zero.sql

Basically causes this:  Hash Join  (cost=8.29..-1.#J rows=1 width=40)


+ * XXX This does exactly the same thing as the previous loop, so no
+ *     comments.

It would be better if we could do something more how make_join_rel() does
things like:

add_paths_to_joinrel(root, joinrel, rel1, rel2,
JOIN_INNER, sjinfo,
restrictlist);
add_paths_to_joinrel(root, joinrel, rel2, rel1,
JOIN_INNER, sjinfo,
restrictlist);

Where the inputs to the function are just swapped.


+ outer = -1;
+ while ((outer = bms_next_member(sjinfo->min_lefthand, outer)) >= 0)

Why did you change this from ensuring we get a singleton member?
I'd imagine if more than 2 relations are required to make the join, then we
can't use foreign keys, as the other join may duplicate or eliminate tuples.
Perhaps you've thought of this more than I have. Would you be able to
explain?

+ * XXX Maybe we should estimate even the single-column keys here,
+ *     as it's really cheap. But it can't do any cross-table check
+ *     of MCV lists or whatever clauselist_selectivity() does.
+ */
+ if (fkinfo->nkeys < 2)
+ continue;

This should be removed. I see no reason at all to pass up likely perfect
estimates for histogram lookup estimates.


Overall, I really feel that you should just take the longest matching
foreign key. i.e the one that matches the most clauses in the joinquals,
and completely ignore any shorter matches, or partial matches.  Just that
alone is probably 99.9999% of the use cases that will benefit from this. No
use adding weird code that's only right half the time for the other 0.0001%
of use cases.

I imagine clauselist_join_selectivity should be more like:

int outerrelid, innerrelid;

if (bms_get_singleton_member(sjinfo->min_righthand, &innerrelid) &&
   bms_get_singleton_member(sjinfo->min_lefthand, &outerrelid))
{
   Bitmapset fkclauses1, fkclauses2;
   List *unmatched;
   Selectivity sel;
   RelOptInfo *innerrel = find_base_rel(root, innerrelid);
   RelOptInfo *outerrel = find_base_rel(root, outerrelid);

   fkclauses1 = find_foreign_key_clauses(root, outerrel, innerrel,
joinquals);
   fkclauses2 = find_foreign_key_clauses(root, innerrel, outerrel,
joinquals);

  if (fkclauses1 || fkclauses2)
  {
     if (bms_num_members(fkclauses1) <
         bms_num_members(fkclauses2))
      {
        bms_free(fkclauses1);
        fkclauses1 = fkclauses2;
        sel = 1.0 / Max(innerrel->tuples, 1.0);
      }
      else
      {
          bms_free(fkclauses2);
        sel = 1.0 / Max(outerrel->tuples, 1.0);
      }
  }

  if (fkclauses1)
  {
     ListCell *lc;
     int lst_idx = 0;
     unmatched = NIL;
     foreach (lc, joinquals)
     {
         Node *clause = (Node *) lfirst(lc);
         if (bms_is_member(fkclauses1, clause))
           unmatched = lappend(unmatched, clause);
     }
     return sel * clauselist_selectivity(root, unmatched,  0,
jointype, sjinfo);
  }
}
else
     return clauselist_selectivity(root, unmatched,  0,  jointype, sjinfo);

Of course, I'm not used to typing code in my email client. So likely it'll
need many fixups.

find_foreign_key_clauses() should look for the longest match and return a
Bitmap set of the list indexes to the caller.
It might be possible to fool the longest match logic by duplicating
clauses, e.g. a1 = b1 AND a1 = b1 and a1 = b1 AND a2 = b2 AND a3 = b3, but
I can't imagine that matters, but if it did, we could code it to be smart
enough to see through that.

One thing I've not thought of yet is the jointype, and also NULL values
references by the foreign key. Perhaps NULLs don't matter as they won't be
matched by the join condition anyway.

There's a few other things, but it makes sense to get the structure right
before going into finer details.

Let me know your thoughts.

Regards

David Rowley

--
 David Rowley                   http://www.2ndQuadrant.com/
<http://www.2ndquadrant.com/>
 PostgreSQL Development, 24x7 Support, Training & Services

drop table if exists a cascade;
drop table if exists b;

create table a (a1 int, a2 int, a3 int, a4 int not null, a5 int not null, 
primary key(a1,a2,a3), unique (a4,a5));
create table b (b1 int, b2 int, b3 int, b4 int not null, b5 int not null);

insert into a select a1,a2,a3,a1*a2,a2*a3 from generate_series(1,100) a1(a1), 
generate_series(1,100) a2(a2),generate_series(1,100) a3(a3);
insert into b select * from a;

analyze a;
analyze b;

select relname,reltuples from pg_class where relname in('a','b');

alter table b add constraint b_b1_b2_b3_fkey foreign key (b1,b2,b3) references 
a;
alter table b add constraint b_b4_b5_fkey foreign key (b4,b5) references 
a(a4,a5);

explain select * from a inner join b on a1=b1 and a2=b2 and a3=b3  and a3=b3 
and a4=b4 and a5=b5;
alter table b drop constraint b_b4_b5_fkey;
explain select * from a inner join b on a1=b1 and a2=b2 and a3=b3  and a3=b3 
and a4=b4 and a5=b5;

-- 
Sent via pgsql-hackers mailing list ([email protected])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] PATCH: use foreign keys to improve join estimates v1

Reply via email to