Hi list!

I'm getting really surprising planner estimates for a query that's
joining another table via a varchar field. All of this was tested on
PostgreSQL 8.4.8, 9.0.4 and 9.1rc1.

The original query is pretty huge, but I managed to shrink it down to this:
SELECT * FROM email_message where email_message.id_code IN (SELECT
id_code FROM client WHERE client_id='83509');

* id_code is an indexed varchar(20) NOT NULL column in both tables
* client_id is the primary key of client.
* There are 149152 rows in email_message and 140975 rows in client
* The most common value in both sides of the join is an empty string.
121970 in email_message and 10753 in client
(Turning the empty values into NULLs helps a little, but still gives
bad estimates)

This is the plan I get:
EXPLAIN SELECT * FROM email_message where email_message.id_code IN
(SELECT id_code FROM client WHERE client_id='83509');
 Nested Loop  (cost=8.28..36.86 rows=139542 width=101)
   ->  HashAggregate  (cost=8.28..8.29 rows=1 width=11)
         ->  Index Scan using client_pkey on client  (cost=0.00..8.28
rows=1 width=11)
               Index Cond: (client_id = 83509)
   ->  Index Scan using email_message_id_code_idx on email_message
(cost=0.00..28.05 rows=41 width=101)
         Index Cond: ((email_message.id_code)::text = (client.id_code)::text)
(6 rows)

This nestloop couldn't possibly generate 139542 rows since the inner
plan is expected to return 1 row and the outer plan 41

After a bit of digging, I figured out that it uses the same estimate
as a semi-join WITHOUT the client_id restriction.
EXPLAIN SELECT * FROM email_message m WHERE EXISTS(SELECT * FROM
client c WHERE m.id_code=c.id_code);
 Nested Loop Semi Join  (cost=0.00..7725.31 rows=139542 width=101)
   ->  Seq Scan on email_message m  (cost=0.00..3966.52 rows=149152 width=101)
   ->  Index Scan using client_id_code_idx1 on client c
(cost=0.00..0.39 rows=1 width=11)
         Index Cond: ((c.id_code)::text = (m.id_code)::text)

For whatever reason, the 1st query completely ignores the fact that
the client_id clause reduces the result count by a large factor.

So I turned this into a simple JOIN and I'm still seeing bad estimates:

EXPLAIN SELECT * FROM email_message JOIN client USING (id_code) WHERE
client_id='83509';
 Nested Loop  (cost=0.00..36.85 rows=9396 width=252)
   ->  Index Scan using client_pkey on client  (cost=0.00..8.28 rows=1
width=162)
         Index Cond: (client_id = 83509)
   ->  Index Scan using email_message_id_code_idx on email_message
(cost=0.00..28.05 rows=41 width=101)
         Index Cond: ((email_message.id_code)::text = (client.id_code)::text)

This is better, but still overestimates massively.

When I change empty values to NULLs, then this JOIN query starts
estimating correctly. So this one is probably confused because the
empty values would result in a cartesian join.

Are there any reasons why nestloop can't use the known (1 * 41) as its estimate?

Regards,
Marti Raudsepp
voicecom.ee

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Reply via email to