Very Fast Version:
Recently my database stopped respecting one of my indexes, which took a query
that should run in "subsecond response time" and turning it into something that
with small data sets runs in the 7-10 minute range and with large data sets
runs in the 30 minute - eternity range.
Explain Analyze tells me that what used to be an Index Scan has become a Seq
Scan, doing a full table scan through 140 million records.
Any thoughts on why that happens?
+++++++++++++++++++
Longer version:
I have a very large database (billion+ records) that has begun not respecting
indexes on some queries, resulting in what used to be "instant answers" now
taking many minutes and I'm trying to figure out why.
I'll try to simplify the request this way:
I have a database of many hundreds of millions of email messages, and
information about those messages. There is a table that tells me when I
received emails and from what source. There is another table that tells what
URLs were seen in the bodies of those emails and there is a table that links
those two tables together. Since many emails can contain the same URL, and
many URLs can be seen in each email, we divide like this:
Tables
===========
emails
link_urls
email_links
email.message_id = link_url.message_id
linkurl.urlid = email_links.urlid
One attribute of the URL is the "hostname" portion.
In my puzzle, I have a table of hostnames, and I want to get statistics about
which emails contained URLs that pointed to those hostnames.
A very simplified version of the query could be:
select email.stuff from email natural join link_url natural join email_links
where hostname = 'foo.bar.com';
We know that there were two emails that advertised foo.bar.com, so the
objective is to link the presence of foo.bar.com to a URLID, use the link_url
table to find the respective message_id and then go ask the email table for
details about those two messages.
My "explain analyze" of that query takes seven minutes to complete, despite the
fact that I have an index on message_id, and urlid, and machine where they
occur in each of the three tables.
We broke the query down into it's components, and confirmed that what we
consider the three natural pieces of the query each run "blazing fast", each
using the proper indexes. But when we put them all together, the index gets
ignored, and we sequential scan a 140 million row table.
Which is slow.
===============================
If we break it into pieces ... a human might think of the pieces as:
============
explain analyze select urlid from email_links where machine = 'foo.bar.com';
QUERY PLAN
-------------------------------------------------------------------------------------------
Bitmap Heap Scan on email_links (cost=97.10..9650.58 rows=2457 width=33)
(actual time=0.049..0.049 rows=1 loops=1)
Recheck Cond: (machine = 'foo.bar.com'::text)
-> Bitmap Index Scan on hostdex (cost=0.00..96.49 rows=2457 width=0)
(actual time=0.039..0.039 rows=1 loops=1)
Index Cond: (machine = 'foo.bar.com'::text)
Total runtime: 0.066 ms
(5 rows)
=============
=> explain analyze select message_id from link_url where urlid =
'9de6440fcc089c654806bd4c853c76f1';
QUERY PLAN
-------------------------------------------------------------------------------------------
Bitmap Heap Scan on link_url (cost=65.20..5680.14 rows=1437 width=4) (actual
time=0.074..0.083 rows=2 loops=1)
Recheck Cond: (urlid = '9de6440fcc089c654806bd4c853c76f1'::text)
-> Bitmap Index Scan on link_url_hkey (cost=0.00..64.84 rows=1437 width=0)
(actual time=0.021..0.021 rows=2 loops=1)
Index Cond: (urlid = '9de6440fcc089c654806bd4c853c76f1'::text)
Total runtime: 0.109 ms
(5 rows)
=================
explain analyze select stuff from email where message_id in (78085350,
78088168);
QUERY PLAN
-------------------------------------------------------------------------------------------
Bitmap Heap Scan on email (cost=17.19..25.21 rows=2 width=8) (actual
time=0.068..0.077 rows=2 loops=1)
Recheck Cond: (message_id = ANY ('{78085350,78088168}'::integer[]))
-> Bitmap Index Scan on email_pkey (cost=0.00..17.19 rows=2 width=0)
(actual time=0.054..0.054 rows=2 loops=1)
Index Cond: (message_id = ANY ('{78085350,78088168}'::integer[]))
Total runtime: 0.100 ms
(5 rows)
++++++++++++++++++++++
In my simple mind that should take (.1 + .109 + .066) = FAST.
Here's the "Explain Analyze" for the combined version:
++++++++++++++++++++++++
explain analyze select email.stuff from email natural join link_url natural
join email_link where machine = 'foo.bar.com';
QUERY PLAN
-------------------------------------------------------------------------------------------
Merge Join (cost=3949462.38..8811048.82 rows=4122698 width=7) (actual
time=771578.076..777749.755 rows=3 loops=1)
Merge Cond: (email.message_id = link_url.message_id)
-> Index Scan using email_pkey on email (cost=0.00..4561330.19
rows=79154951 width=11) (actual time=0.041..540883.445 rows=79078427 loops=1)
-> Materialize (cost=3948986.49..4000520.21 rows=4122698 width=4) (actual
time=227023.820..227023.823 rows=3 loops=1)
-> Sort (cost=3948986.49..3959293.23 rows=4122698 width=4) (actual
time=227023.816..227023.819 rows=3 loops=1)
Sort Key: link_url.message_id
Sort Method: quicksort Memory: 25kB
-> Hash Join (cost=9681.33..3326899.30 rows=4122698 width=4)
(actual time=216443.617..227023.798 rows=3 loops=1)
Hash Cond: (link_url.urlid = email_link.urlid)
-> Seq Scan on link_url (cost=0.00..2574335.33
rows=140331133 width=37) (actual time=0.013..207980.261 rows=140330592 lo
ops=1)
-> Hash (cost=9650.62..9650.62 rows=2457 width=33)
(actual time=0.074..0.074 rows=1 loops=1)
-> Bitmap Heap Scan on email_link
(cost=97.10..9650.62 rows=2457 width=33) (actual time=0.072..0.072 rows=1
loops=1
)
Recheck Cond: (hostname = 'foo.bar.com'::text)
-> Bitmap Index Scan on hostdex
(cost=0.00..96.49 rows=2457 width=0) (actual time=0.060..0.060 rows=1 loops=1
)
Index Cond: (hostname =
'foo.bar.com'::text)
Total runtime: 777749.820 ms
(16 rows)
++++++++++++++++++
See that "Seq Scan on link_url"? We can't figure out why that is there! We
should be scanning for a matching "urlid" and we have an index on "urlid"?
When this is happening in a "two table" version of this problem, we can get
temporary relief by giving the statement:
set enable_seqscan = false;
But in the "three table" version (which is itself a simplification of the real
problem, which is that instead of looking for 'foo.bar.com', I'm looking for
all the hostnames in the table "testurls") the "enable_seqscan = false" doesn't
seem to do anything.
++++++++++++++++++
Does anyone have suggestions as to why my database has started ignoring this
index in certain circumstances? We've tried a variety of different joins and
subselects and can't seem to beat the "combined query" index failure.
The work-around has been to actually write code that makes the query for
urlids, and then manually does the query into the table for message_ids, and
then manually does the query for email.stuff. It's bad fast, but we can't
figure out why 'the old way' quit working.
Suggestions welcome!
--
----------------------------------------------------------
Gary Warner
-----------------------------------------------------------
--
Sent via pgsql-performance mailing list ([email protected])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance