Hi list! I'm getting really surprising planner estimates for a query that's joining another table via a varchar field. All of this was tested on PostgreSQL 8.4.8, 9.0.4 and 9.1rc1.
The original query is pretty huge, but I managed to shrink it down to this: SELECT * FROM email_message where email_message.id_code IN (SELECT id_code FROM client WHERE client_id='83509'); * id_code is an indexed varchar(20) NOT NULL column in both tables * client_id is the primary key of client. * There are 149152 rows in email_message and 140975 rows in client * The most common value in both sides of the join is an empty string. 121970 in email_message and 10753 in client (Turning the empty values into NULLs helps a little, but still gives bad estimates) This is the plan I get: EXPLAIN SELECT * FROM email_message where email_message.id_code IN (SELECT id_code FROM client WHERE client_id='83509'); Nested Loop (cost=8.28..36.86 rows=139542 width=101) -> HashAggregate (cost=8.28..8.29 rows=1 width=11) -> Index Scan using client_pkey on client (cost=0.00..8.28 rows=1 width=11) Index Cond: (client_id = 83509) -> Index Scan using email_message_id_code_idx on email_message (cost=0.00..28.05 rows=41 width=101) Index Cond: ((email_message.id_code)::text = (client.id_code)::text) (6 rows) This nestloop couldn't possibly generate 139542 rows since the inner plan is expected to return 1 row and the outer plan 41 After a bit of digging, I figured out that it uses the same estimate as a semi-join WITHOUT the client_id restriction. EXPLAIN SELECT * FROM email_message m WHERE EXISTS(SELECT * FROM client c WHERE m.id_code=c.id_code); Nested Loop Semi Join (cost=0.00..7725.31 rows=139542 width=101) -> Seq Scan on email_message m (cost=0.00..3966.52 rows=149152 width=101) -> Index Scan using client_id_code_idx1 on client c (cost=0.00..0.39 rows=1 width=11) Index Cond: ((c.id_code)::text = (m.id_code)::text) For whatever reason, the 1st query completely ignores the fact that the client_id clause reduces the result count by a large factor. So I turned this into a simple JOIN and I'm still seeing bad estimates: EXPLAIN SELECT * FROM email_message JOIN client USING (id_code) WHERE client_id='83509'; Nested Loop (cost=0.00..36.85 rows=9396 width=252) -> Index Scan using client_pkey on client (cost=0.00..8.28 rows=1 width=162) Index Cond: (client_id = 83509) -> Index Scan using email_message_id_code_idx on email_message (cost=0.00..28.05 rows=41 width=101) Index Cond: ((email_message.id_code)::text = (client.id_code)::text) This is better, but still overestimates massively. When I change empty values to NULLs, then this JOIN query starts estimating correctly. So this one is probably confused because the empty values would result in a cartesian join. Are there any reasons why nestloop can't use the known (1 * 41) as its estimate? Regards, Marti Raudsepp voicecom.ee -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers