[HACKERS] limit in subquery causes poor selectivity estimation

Peter Eisentraut Sat, 27 Aug 2011 04:34:16 -0700

This is an artificial test case shrunk down from a much larger real
query.

CREATE TABLE test1 (
    sha1 bytea PRIMARY KEY,
    something text
);


CREATE TABLE test2 (
    sha1 bytea PRIMARY KEY,
    blah text
);

Fill those with 1000 random rows each, e.g.,

for i in $(seq 1 1000); do sha1=$(echo $i | sha1sum | cut -f1 -d' '); psql -d 
test -c "INSERT INTO test1 VALUES (decode('$sha1','hex'), 'blah$i$i')"; done
for i in $(seq 500 1500); do sha1=$(echo $i | sha1sum | cut -f1 -d' '); psql -d 
test -c "INSERT INTO test2 VALUES (decode('$sha1','hex'), 'foo$i')"; done

(Doesn't really matter whether the key values are the same or
overlapping or not.  Just to make it interesting.)

ANALYZE;

EXPLAIN SELECT * FROM test1  WHERE sha1 in (SELECT sha1 FROM test2);
                              QUERY PLAN
----------------------------------------------------------------------
 Hash Semi Join  (cost=30.52..61.27 rows=1000 width=27)
   Hash Cond: (test1.sha1 = test2.sha1)
   ->  Seq Scan on test1  (cost=0.00..17.00 rows=1000 width=27)
   ->  Hash  (cost=18.01..18.01 rows=1001 width=21)
         ->  Seq Scan on test2  (cost=0.00..18.01 rows=1001 width=21)

That's OK.  Apparently it can tell that joining two tables on their
primary keys cannot result in more rows than the smaller table.  (Or
can it?)

EXPLAIN SELECT * FROM test1  WHERE sha1 in (SELECT sha1 FROM test2 LIMIT 200);
                                    QUERY PLAN
----------------------------------------------------------------------------------
 Hash Join  (cost=10.60..33.35 rows=500 width=27)
   Hash Cond: (test1.sha1 = test2.sha1)
   ->  Seq Scan on test1  (cost=0.00..17.00 rows=1000 width=27)
   ->  Hash  (cost=8.10..8.10 rows=200 width=32)
         ->  HashAggregate  (cost=6.10..8.10 rows=200 width=32)
               ->  Limit  (cost=0.00..3.60 rows=200 width=21)
                     ->  Seq Scan on test2  (cost=0.00..18.01 rows=1001 
width=21)

Here, however, it has apparently not passed this knowledge through the
LIMIT.

The problem is that in the real query, instead of the 500 up there it
estimates about 30 million (which might be a reasonable estimate for a
join between two unkeyed columns), when it should in fact be 200.
(And again, this is part of a larger query, which is then completely
messed up because of this misestimation.)

So what's up with that?  Just a case of, we haven't thought about
covering this case yet, or are there larger problems?


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] limit in subquery causes poor selectivity estimation

Reply via email to