This is an artificial test case shrunk down from a much larger real query. CREATE TABLE test1 ( sha1 bytea PRIMARY KEY, something text );
CREATE TABLE test2 ( sha1 bytea PRIMARY KEY, blah text ); Fill those with 1000 random rows each, e.g., for i in $(seq 1 1000); do sha1=$(echo $i | sha1sum | cut -f1 -d' '); psql -d test -c "INSERT INTO test1 VALUES (decode('$sha1','hex'), 'blah$i$i')"; done for i in $(seq 500 1500); do sha1=$(echo $i | sha1sum | cut -f1 -d' '); psql -d test -c "INSERT INTO test2 VALUES (decode('$sha1','hex'), 'foo$i')"; done (Doesn't really matter whether the key values are the same or overlapping or not. Just to make it interesting.) ANALYZE; EXPLAIN SELECT * FROM test1 WHERE sha1 in (SELECT sha1 FROM test2); QUERY PLAN ---------------------------------------------------------------------- Hash Semi Join (cost=30.52..61.27 rows=1000 width=27) Hash Cond: (test1.sha1 = test2.sha1) -> Seq Scan on test1 (cost=0.00..17.00 rows=1000 width=27) -> Hash (cost=18.01..18.01 rows=1001 width=21) -> Seq Scan on test2 (cost=0.00..18.01 rows=1001 width=21) That's OK. Apparently it can tell that joining two tables on their primary keys cannot result in more rows than the smaller table. (Or can it?) EXPLAIN SELECT * FROM test1 WHERE sha1 in (SELECT sha1 FROM test2 LIMIT 200); QUERY PLAN ---------------------------------------------------------------------------------- Hash Join (cost=10.60..33.35 rows=500 width=27) Hash Cond: (test1.sha1 = test2.sha1) -> Seq Scan on test1 (cost=0.00..17.00 rows=1000 width=27) -> Hash (cost=8.10..8.10 rows=200 width=32) -> HashAggregate (cost=6.10..8.10 rows=200 width=32) -> Limit (cost=0.00..3.60 rows=200 width=21) -> Seq Scan on test2 (cost=0.00..18.01 rows=1001 width=21) Here, however, it has apparently not passed this knowledge through the LIMIT. The problem is that in the real query, instead of the 500 up there it estimates about 30 million (which might be a reasonable estimate for a join between two unkeyed columns), when it should in fact be 200. (And again, this is part of a larger query, which is then completely messed up because of this misestimation.) So what's up with that? Just a case of, we haven't thought about covering this case yet, or are there larger problems? -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers