Re: [HACKERS] PATCH: use foreign keys to improve join estimates v1

Tomas Vondra Mon, 28 Sep 2015 06:00:19 -0700

Hi,

On 09/27/2015 02:00 PM, David Rowley wrote:

I've been working on this again. I've put back the code that you wrote
for the looping over each combination of relations from either side of
the join.


I've also added some code to get around the problem with eclass joins
and the RestrictInfo having some alternative Vars that don't belong to
the foreign key. Basically I'm just checking if the RestrictInfo has a
parent_ec, and if it does just loop over the members to try and find the
Vars that belong to the foreign key. I've tested it with the following,
and it seems to work:

I didn't have time to look into the code yet, but this seems like aninteresting idea.


create table a as select i as a_id1, i as a_id2, i as dummy1 from
generate_series(0,999) s(i);
alter table a add unique (a_id1, a_id2);
create table b as select i as b_id1, i as b_id2 from
generate_series(0,332) s(i);

analyze a;
analyze b;

alter table b add foreign key (b_id1, b_id2) references a (a_id1, a_id2);

explain analyze select * from a inner join b on a.dummy1 = b.b_id1 and
a.a_id2 = b.b_id2 where a.a_id1 = a.dummy1;

                                                  QUERY PLAN
-----------------------------------------------------------------------------------------------------------
  Hash Join  (cost=18.57..26.41 rows=2 width=20) (actual
time=0.775..1.046 rows=333 loops=1)
    Hash Cond: ((b.b_id1 = a.dummy1) AND (b.b_id2 = a.a_id2))
    ->  Seq Scan on b  (cost=0.00..5.33 rows=333 width=8) (actual
time=0.013..0.046 rows=333 loops=1)
    ->  Hash  (cost=18.50..18.50 rows=5 width=12) (actual
time=0.737..0.737 rows=1000 loops=1)
          Buckets: 1024  Batches: 1  Memory Usage: 51kB
          ->  Seq Scan on a  (cost=0.00..18.50 rows=5 width=12) (actual
time=0.014..0.389 rows=1000 loops=1)
                Filter: (dummy1 = a_id1)

The non-patched version estimates 1 row. The patched estimates 2 rows,
but that's due to the bad estimate on dummy1 = a_id1.

The 2 comes from ceil(5 * 0.333).

Perhaps you have a better test case to for this?

I think the additional WHERE clause is needlessly confusing. I've beenable to come up with an example - pretty much a normalized with a "main"table and auxiliary tables (referencing the main one using FK) withadditional info. So not unlikely to happen in practice (except maybe forthe multi-column foreign key bit).



CREATE TABLE f (id1 INT, id2 INT, PRIMARY KEY (id1, id2));

CREATE TABLE d1 (id1 INT, id2 INT, FOREIGN KEY (id1, id2) REFERENCESf(id1, id2));CREATE TABLE d2 (id1 INT, id2 INT, FOREIGN KEY (id1, id2) REFERENCESf(id1, id2));


INSERT INTO f SELECT i, i FROM generate_series(1,1000000) s(i);

INSERT INTO d1 SELECT i, i FROM generate_series(1,100000) s(i);
INSERT INTO d2 SELECT i, i FROM generate_series(1,300000) s(i);

now, both pair-wise joins (f JOIN d1) and (f JOIN d2) are estimatedperfectly accurately, but as soon as the query involves both of them,this happens:


SELECT * FROM f JOIN d1 ON (f.id1 = d1.id1 AND f.id2 = d1.id2)
                JOIN d2 ON (f.id1 = d2.id1 AND f.id2 = d2.id2);

                          QUERY PLAN
-------------------------------------------------------------------------
 Nested Loop  (cost=3334.43..12647.57 rows=30000 width=24)
              (actual time=221.086..1767.206 rows=100000 loops=1)
   Join Filter: ((d1.id1 = f.id1) AND (d1.id2 = f.id2))
   ->  Hash Join  (cost=3334.00..12647.01 rows=1 width=16)
                  (actual time=221.058..939.482 rows=100000 loops=1)
         Hash Cond: ((d2.id1 = d1.id1) AND (d2.id2 = d1.id2))
         ->  Seq Scan on d2  (cost=0.00..4328.00 rows=300000 width=8)
                     (actual time=0.038..263.356 rows=300000 loops=1)
         ->  Hash  (cost=1443.00..1443.00 rows=100000 width=8)
                   (actual time=220.721..220.721 rows=100000 loops=1)
               Buckets: 131072  Batches: 2  Memory Usage: 2982kB
               ->  Seq Scan on d1  (cost=0.00..1443.00 rows=100000 ...)
                       (actual time=0.033..101.547 rows=100000 loops=1)
   ->  Index Only Scan using f_pkey on f  (cost=0.42..0.54 rows=1 ...)
                        (actual time=0.004..0.004 rows=1 loops=100000)
         Index Cond: ((id1 = d2.id1) AND (id2 = d2.id2))
         Heap Fetches: 100000

Clearly, the inner join (d1 JOIN d2) is poorly estimated (1 vs. 100000).I assume that's only because we find FK only on the second join with f.

So it seems like s a clear improvement, both compared to master and theprevious versions of the patch.


regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] PATCH: use foreign keys to improve join estimates v1

Reply via email to