Hi all, I'm using postgres 8.4.2 on a Ubuntu Linux machine.
I have several tables, one of which is named Document, which of course represents information I need about my documents. I also have another table, similar to the first one, called Doc2. The schema of both tables is the following: CREATE TABLE "Document" ( docid integer NOT NULL DEFAULT nextval('doc_id_seq'::regclass), hwdocid character varying(511) NOT NULL, pubdate bigint, finished boolean DEFAULT false, "location" character varying(200), title tsvector, description tsvector, "content" text, CONSTRAINT pk_docid PRIMARY KEY (docid), CONSTRAINT hwdocid_uniq UNIQUE (hwdocid) ) WITH ( OIDS=FALSE ); The hwdocid in this occasion is no longer than 12 characters. The reason for being 511 max, is because the same schema is used by other applications. What i wish to do is dump contents from Doc2 to Document, provided that the hwdocid from Doc2 is not present in Document (as the entries will be similar). Doc2 contains ~100000 rows while Document contains ~1000000. Now, I wrote a simple query to do this, which is the following: INSERT INTO "Document" ( hwdocid, pubdate, finished, "location", title, description, "content" ) SELECT hwdocid, pubdate, finished, "location", title, description, "content" FROM "Doc2" d2 WHERE d2.hwdocid NOT IN ( SELECT d.hwdocid FROM "Document" d ) After running for about half an hour in pgadmin3, I stopped the execution, since I saw that what I was doing was pretty dumb, as with every insert the Document would increase (and I know beforehand that data from Doc2 contain unique hwdocid values). At first I thought that each INSERT creates a new transaction, which is why it was taking so long. So I though I should do something else.. So, I though that I should dump the documents I want to a temp table and then simply insert them in the Document table. Before that, I wanted to see however, how many documents I was trying to insert (as an indication of why it took so long). So I simply did the select part for those documents. SELECT * FROM "Doc2" d2 WHERE d2.hwdocid NOT IN ( SELECT d.hwdocid FROM "Document" d ) I submitted the query again and let it run. After running for 5 hours, I stopped the query and submitted the "explain query". After running for ~10 minutes, I also stopped the query explanation phase. So I re-wrote the query as: SELECT hwdocid, pubdate, finished, "location", title, description, "content" FROM "Doc2" d2 WHERE NOT EXISTS ( SELECT d.hwdocid FROM "Document" d WHERE d.hwdocid = d2.hwdocid ) and asked for the explanation, which was: Hash Anti Join (cost=72484.24..90988.89 rows=1 width=317) (actual time=3815.471..9063.184 rows=63836 loops=1) Hash Cond: ((d2.hwdocid)::text = (d.hwdocid)::text) -> Seq Scan on "Doc2" d2 (cost=0.00..5142.54 rows=96454 width=317) (actual time=0.016..186.781 rows=96454 loops=1) -> Hash (cost=56435.22..56435.22 rows=949922 width=12) (actual time=3814.968..3814.968 rows=948336 loops=1) -> Seq Scan on "Document" d (cost=0.00..56435.22 rows=949922 width=12) (actual time=0.008..1926.191 rows=948336 loops=1) Total runtime: 9159.050 ms I then submitted it normally and got a result back in ~5-6 seconds. So my questions are: 1) Why is it taking *so* long for the first query (with the "NOT IN" ) to do even the simple select? 2) The result between the two queries should be the same. Since I am not even returned an explanation, could someone make a (wild) guess on what is the "NOT IN" statement doing (trying to do) that is taking so long? 3) My intuition would be that, since there exists a unique constraint on hwdocid, which implies the existence of an index, this index would be used. Isn't that so? I mean, since it is a unique field, shouldn't it just do a sequential scan on Doc2 and then simply query the index if the value exists? What am I getting wrong? Thank you very much in advance! Regards, George Valkanas