On Wed, Mar 23, 2011 at 11:58 AM, Adarsh Sharma <adarsh.sha...@orkash.com>wrote:
> Dear all, > > I have 2 tables in my database name clause2( 4900 MB) & page_content(1582 > MB). > > My table definations are as : > > *page_content :- > > *CREATE TABLE page_content > ( > content_id integer, > wkb_geometry geometry, > link_level integer, > isprocessable integer, > isvalid integer, > isanalyzed integer, > islocked integer, > content_language character(10), > url_id integer, > publishing_date character(40), > heading character(150), > category character(150), > crawled_page_url character(500), > keywords character(500), > dt_stamp timestamp with time zone, > "content" character varying, > crawled_page_id bigint, > id integer > ) > WITH ( > OIDS=FALSE > ); > > *Indexes on it :-* > CREATE INDEX idx_page_id ON page_content USING btree (crawled_page_id); > CREATE INDEX idx_page_id_content ON page_content USING btree > (crawled_page_id, content_language, publishing_date, isprocessable); > CREATE INDEX pgweb_idx ON page_content USING gin > (to_tsvector('english'::regconfig, content::text)); > > *clause 2:- > *CREATE TABLE clause2 > ( > id bigint NOT NULL DEFAULT nextval('clause_id_seq'::regclass), > source_id integer, > sentence_id integer, > clause_id integer, > tense character varying(30), > clause text, > CONSTRAINT pk_clause_demo_id PRIMARY KEY (id) > )WITH ( OIDS=FALSE); > > *Indexes on it : > > *CREATE INDEX idx_clause2_march10 > ON clause2 > USING btree > (id, source_id);* > > *I perform a join query on it as : > > * explain analyze select distinct(p.crawled_page_id) from page_content p , > clause2 c where p.crawled_page_id != c.source_id ; > > *What it takes more than 1 hour to complete. As I issue the explain > analyze command and cannot able to wait for output but I send my explain > output as : > QUERY > PLAN > > -------------------------------------------------------------------------------------------------------- > Unique (cost=927576.16..395122387390.13 rows=382659 width=8) > -> Nested Loop (cost=927576.16..360949839832.15 rows=13669019023195 > width=8) > Join Filter: (p.crawled_page_id <> c.source_id) > -> Index Scan using idx_page_id on page_content p > (cost=0.00..174214.02 rows=428817 width=8) > -> Materialize (cost=927576.16..1370855.12 rows=31876196 > width=4) > -> Seq Scan on clause2 c (cost=0.00..771182.96 > rows=31876196 width=4) > (6 rows) > > > Please guide me how to make the above query run faster as I am not able to > do that. > > > Thanks, Adarsh > > * > > * > Could you try just explaining the below query: explain select distinct(p.crawled_page_id) from page_content p where NOT EXISTS (select 1 from clause2 c where c.source_id = p.crawled_page_id); The idea here is to avoid directly using NOT operator. Regards, Chetan -- Chetan Suttraway EnterpriseDB <http://www.enterprisedb.com/>, The Enterprise PostgreSQL<http://www.enterprisedb.com/> company.