Very Fast Version:

Recently my database stopped respecting one of my indexes, which took a query 
that should run in "subsecond response time" and turning it into something that 
with small data sets runs in the 7-10 minute range and with large data sets 
runs in the 30 minute - eternity range.

Explain Analyze tells me that what used to be an Index Scan has become a Seq 
Scan, doing a full table scan through 140 million records.

Any thoughts on why that happens?

+++++++++++++++++++
Longer version:

I have a very large database (billion+ records) that has begun not respecting 
indexes on some queries, resulting in what used to be "instant answers" now 
taking many minutes and I'm trying to figure out why.

I'll try to simplify the request this way:

I have a database of many hundreds of millions of email messages, and 
information about those messages.  There is a table that tells me when I 
received emails and from what source.  There is another table that tells what 
URLs were seen in the bodies of those emails and there is a table that links 
those two tables together.  Since many emails can contain the same URL, and 
many URLs can be seen in each email, we divide like this:

Tables
===========
emails

link_urls

email_links

email.message_id     = link_url.message_id
linkurl.urlid        = email_links.urlid


One attribute of the URL is the "hostname" portion.

In my puzzle, I have a table of hostnames, and I want to get statistics about 
which emails contained URLs that pointed to those hostnames.

A very simplified version of the query could be:

select email.stuff from email natural join link_url natural join email_links 
where hostname = 'foo.bar.com';


We know that there were two emails that advertised foo.bar.com, so the 
objective is to link the presence of foo.bar.com to a URLID, use the link_url 
table to find the respective message_id and then go ask the email table for 
details about those two messages.


My "explain analyze" of that query takes seven minutes to complete, despite the 
fact that I have an index on message_id, and urlid, and machine where they 
occur in each of the three tables.

We broke the query down into it's components, and confirmed that what we 
consider the three natural pieces of the query each run "blazing fast", each 
using the proper indexes.  But when we put them all together, the index gets 
ignored, and we sequential scan a 140 million row table.

Which is slow.


===============================
If we break it into pieces ... a human might think of the pieces as:


============

explain analyze select urlid from email_links where machine = 'foo.bar.com';

                               QUERY PLAN                  
-------------------------------------------------------------------------------------------
 Bitmap Heap Scan on email_links  (cost=97.10..9650.58 rows=2457 width=33) 
(actual time=0.049..0.049 rows=1 loops=1)
   Recheck Cond: (machine = 'foo.bar.com'::text)
   ->  Bitmap Index Scan on hostdex  (cost=0.00..96.49 rows=2457 width=0) 
(actual time=0.039..0.039 rows=1 loops=1)
         Index Cond: (machine = 'foo.bar.com'::text)
 Total runtime: 0.066 ms
(5 rows)


=============

=> explain analyze select message_id from link_url where urlid = 
'9de6440fcc089c654806bd4c853c76f1';
  QUERY PLAN                                                        
-------------------------------------------------------------------------------------------
 Bitmap Heap Scan on link_url  (cost=65.20..5680.14 rows=1437 width=4) (actual 
time=0.074..0.083 rows=2 loops=1)
   Recheck Cond: (urlid = '9de6440fcc089c654806bd4c853c76f1'::text)
   ->  Bitmap Index Scan on link_url_hkey  (cost=0.00..64.84 rows=1437 width=0) 
(actual time=0.021..0.021 rows=2 loops=1)
         Index Cond: (urlid = '9de6440fcc089c654806bd4c853c76f1'::text)
 Total runtime: 0.109 ms
(5 rows)

=================

explain analyze select stuff from email where message_id in (78085350, 
78088168);
 QUERY PLAN                                                     
-------------------------------------------------------------------------------------------
 Bitmap Heap Scan on email  (cost=17.19..25.21 rows=2 width=8) (actual 
time=0.068..0.077 rows=2 loops=1)
   Recheck Cond: (message_id = ANY ('{78085350,78088168}'::integer[]))
   ->  Bitmap Index Scan on email_pkey  (cost=0.00..17.19 rows=2 width=0) 
(actual time=0.054..0.054 rows=2 loops=1)
         Index Cond: (message_id = ANY ('{78085350,78088168}'::integer[]))
 Total runtime: 0.100 ms
(5 rows)

++++++++++++++++++++++


In my simple mind that should take (.1 + .109 + .066) = FAST.

Here's the "Explain Analyze" for the combined version:

++++++++++++++++++++++++
explain analyze select email.stuff from email natural join link_url natural 
join email_link where machine = 'foo.bar.com';

 QUERY PLAN                                                               
-------------------------------------------------------------------------------------------
 Merge Join  (cost=3949462.38..8811048.82 rows=4122698 width=7) (actual 
time=771578.076..777749.755 rows=3 loops=1)
   Merge Cond: (email.message_id = link_url.message_id)
   ->  Index Scan using email_pkey on email  (cost=0.00..4561330.19 
rows=79154951 width=11) (actual time=0.041..540883.445 rows=79078427 loops=1)
   ->  Materialize  (cost=3948986.49..4000520.21 rows=4122698 width=4) (actual 
time=227023.820..227023.823 rows=3 loops=1)
         ->  Sort  (cost=3948986.49..3959293.23 rows=4122698 width=4) (actual 
time=227023.816..227023.819 rows=3 loops=1)
               Sort Key: link_url.message_id
               Sort Method:  quicksort  Memory: 25kB
               ->  Hash Join  (cost=9681.33..3326899.30 rows=4122698 width=4) 
(actual time=216443.617..227023.798 rows=3 loops=1)
                     Hash Cond: (link_url.urlid = email_link.urlid)
                     ->  Seq Scan on link_url  (cost=0.00..2574335.33 
rows=140331133 width=37) (actual time=0.013..207980.261 rows=140330592 lo
ops=1)
                     ->  Hash  (cost=9650.62..9650.62 rows=2457 width=33) 
(actual time=0.074..0.074 rows=1 loops=1)
                           ->  Bitmap Heap Scan on email_link  
(cost=97.10..9650.62 rows=2457 width=33) (actual time=0.072..0.072 rows=1 
loops=1
)
                                 Recheck Cond: (hostname = 'foo.bar.com'::text)
                                 ->  Bitmap Index Scan on hostdex  
(cost=0.00..96.49 rows=2457 width=0) (actual time=0.060..0.060 rows=1 loops=1
)
                                       Index Cond: (hostname = 
'foo.bar.com'::text)
 Total runtime: 777749.820 ms
(16 rows)

++++++++++++++++++

See that "Seq Scan on link_url"?  We can't figure out why that is there!  We 
should be scanning for a matching "urlid" and we have an index on "urlid"?

When this is happening in a "two table" version of this problem, we can get 
temporary relief by giving the statement:

set enable_seqscan = false;

But in the "three table" version (which is itself a simplification of the real 
problem, which is that instead of looking for 'foo.bar.com', I'm looking for 
all the hostnames in the table "testurls") the "enable_seqscan = false" doesn't 
seem to do anything.


++++++++++++++++++

Does anyone have suggestions as to why my database has started ignoring this 
index in certain circumstances?   We've tried a variety of different joins and 
subselects and can't seem to beat the "combined query" index failure.

The work-around has been to actually write code that makes the query for 
urlids, and then manually does the query into the table for message_ids, and 
then manually does the query for email.stuff.  It's bad fast, but we can't 
figure out why 'the old way' quit working.

Suggestions welcome!




--

----------------------------------------------------------

Gary Warner

-----------------------------------------------------------


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Reply via email to