Any explanation for this is appreciated...

I had a query which just seemed to peg the processor. I decided to let it go 
all night, and after 16 hours, it was still stuck at 100% processor usage. I 
figured something must be wrong, so I tried a few things. Using the following 
logic, I was able to get the results I was looking for in about 1 minute 
(instead of 16+ hours of no result). It all came down to getting around the use 
of "NOT IN (subquery)" where "subquery" has 20,000,000 rows. "IN (subquery)" 
was quick as could be. Here's what I found. 

Tables:
------------------------------
|       a     |      b       |
--------------|---------------
|guid vc(32)  | guid vc(32)  |
|unitid vc(15)| unitid vc(15)|
|ftime ts     |              |
|source vc(10)|              |
------------------------------
|65,000,000 rw| 500,000 rows |
------------------------------

Here is the query that hung up for 16 hours:

A:
SELECT unitid,count(guid) AS total FROM b 
WHERE guid NOT IN (SELECT DISTINCT guid FROM a) AND
 unitid IN (SELECT DISTINCT unitid FROM a)
GROUP BY unitid 
ORDER BY total DESC;

I thought I'd narrow down a.guid and a.unitid and substitute these for the 
subqueries.
B: SELECT DISTINCT unitid INTO a_u_unitid FROM a; (     1,082 rows)
C: SELECT DISTINCT guid   INTO a_u_guid   FROM a; (20,000,000 rows)

That didn't help on its own. So I decided to try the query without the "NOT". 
It finished in a minute or two. Granted, the result is the opposite of what I 
wanted, but at least I knew the major source of the problem. I still had the 
substituted queries B & C in, so it looked like this:

D:
SELECT unitid,count(guid) AS total FROM b 
WHERE guid IN (SELECT guid FROM a_u_guid) AND
 unitid IN (SELECT unitid FROM a_u_unitid) 
GROUP BY unitid 
ORDER BY total DESC;

Next, I decided to just get my "NOT" set of guid's from b itself. If my logic 
isn't correct here, someone let me know. I did this, and got results very 
quickly (a minute perhaps):

E:
SELECT unitid,count(guid) AS total FROM b 
WHERE guid NOT IN (SELECT DISTINCT guid FROM b WHERE guid IN (SELECT guid FROM 
a_u_guid)) AND
 unitid IN (SELECT unitid FROM a_u_unitid) 
GROUP BY unitid 
ORDER BY total DESC;

Keep in mind a has 65,000,000 rows and b only has 500,000. The above query (E) 
gave me the original results I was looking for (E gives the expected results 
from A, if A had ever finished, assuming my logic is correct). I then decided 
to get rid of the intermediate tables B & C, so I was back to:

F:
SELECT unitid,count(guid) AS total FROM b 
WHERE guid NOT IN (SELECT DISTINCT guid FROM b WHERE guid IN (SELECT DISTINCT 
guid FROM a) AND
 Unitid IN (SELECT DISTINCT unitid FROM a)
GROUP BY unitid 
ORDER BY total DESC;

While F was running, I started typing this email. It still has not finished, 
but I have a feeling it will within another half hour or so. What I am taking 
away from this is that the use of "NOT IN (subquery)" where subquery has a 
large number of results will literally shut down performance and never give a 
result. Further, I am taking away that creating separate tables with just the 
DISTINCT values from a table with millions of rows also saves time. The problem 
here being that the separate table is not auto-updating when values in the 
master table change.

Query E then is apparently the way to go, but shouldn't there be a way to get 
the query planner to take these steps itself? If A had ever finished, I'd sure 
like to have seen an EXPLAIN ANALYZE on it. If F does finish, I made run an 
EXPLAIN ANALYZE on it versus E (I kind of expected F to finish while writing 
this email).

Thanks,
Jon

-- 
Sent via pgsql-admin mailing list ([email protected])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-admin

Reply via email to