Re: [PERFORM] Reason of Slowness of query

Adarsh Sharma Wed, 23 Mar 2011 00:26:31 -0700

Thanks Chetan, here is the output of your updated query :

*explain select distinct(p.crawled_page_id) from page_content p whereNOT EXISTS (select 1 from clause2 c where c.source_id = p.crawled_page_id);

QUERYPLAN---------------------------------------------------------------------------------------

HashAggregate  (cost=1516749.47..1520576.06 rows=382659 width=8)
  ->  Hash Anti Join  (cost=1294152.41..1515791.80 rows=383071 width=8)
        Hash Cond: (p.crawled_page_id = c.source_id)

-> Seq Scan on page_content p (cost=0.00..87132.17rows=428817 width=8)

        ->  Hash  (cost=771182.96..771182.96 rows=31876196 width=4)

-> Seq Scan on clause2 c (cost=0.00..771182.96rows=31876196 width=4)

(6 rows)

And my explain analyze output is :

QUERYPLAN--------------------------------------------------------------------------------------------------------------------------------------------HashAggregate (cost=1516749.47..1520576.06 rows=382659 width=8)(actual time=56666.181..56669.270 rows=72 loops=1)-> Hash Anti Join (cost=1294152.41..1515791.80 rows=383071 width=8)(actual time=45740.789..56665.816 rows=74 loops=1)

        Hash Cond: (p.crawled_page_id = c.source_id)

-> Seq Scan on page_content p (cost=0.00..87132.17rows=428817 width=8) (actual time=0.012..715.915 rows=428467 loops=1)-> Hash (cost=771182.96..771182.96 rows=31876196 width=4)(actual time=45310.524..45310.524 rows=31853083 loops=1)-> Seq Scan on clause2 c (cost=0.00..771182.96rows=31876196 width=4) (actual time=0.055..23408.884 rows=31853083 loops=1)

Total runtime: 56687.660 ms
(7 rows)

But Is there is any option to tune it further and one more thing outputrows varies from 6 to 7.



Thanks & best Regards,
Adarsh Sharma






Chetan Suttraway wrote:

On Wed, Mar 23, 2011 at 11:58 AM, Adarsh Sharma<[email protected] <mailto:[email protected]>> wrote:


    Dear all,

    I have 2 tables in my database name clause2( 4900 MB) &
    page_content(1582 MB).

    My table definations are as :

    *page_content :-

    *CREATE TABLE page_content
    (
      content_id integer,
      wkb_geometry geometry,
      link_level integer,
      isprocessable integer,
      isvalid integer,
      isanalyzed integer,
      islocked integer,
      content_language character(10),
      url_id integer,
      publishing_date character(40),
      heading character(150),
      category character(150),
      crawled_page_url character(500),
      keywords character(500),
      dt_stamp timestamp with time zone,
      "content" character varying,
      crawled_page_id bigint,
      id integer
    )
    WITH (
      OIDS=FALSE
    );

    *Indexes on it :-*

CREATE INDEX idx_page_id ON page_content USING btree(crawled_page_id);CREATE INDEX idx_page_id_content ON page_content USING btree(crawled_page_id, content_language, publishing_date, isprocessable);CREATE INDEX pgweb_idx ON page_content USING gin(to_tsvector('english'::regconfig, content::text));


    *clause 2:-
    *CREATE TABLE clause2
    (
      id bigint NOT NULL DEFAULT nextval('clause_id_seq'::regclass),
      source_id integer,
      sentence_id integer,
      clause_id integer,
      tense character varying(30),
      clause text,
      CONSTRAINT pk_clause_demo_id PRIMARY KEY (id)
    )WITH ( OIDS=FALSE);

    *Indexes on it :

    *CREATE INDEX idx_clause2_march10
      ON clause2
      USING btree
      (id, source_id);*

    *I perform a join query on it as :

    * explain analyze select distinct(p.crawled_page_id) from
    page_content p , clause2  c where p.crawled_page_id != c.source_id ;

    *What it takes more than 1 hour to complete. As I issue the
    explain analyze command and cannot able to wait for output but I
    send my explain output as :
                                                 QUERY

PLAN--------------------------------------------------------------------------------------------------------

     Unique  (cost=927576.16..395122387390.13 rows=382659 width=8)
       ->  Nested Loop  (cost=927576.16..360949839832.15
    rows=13669019023195 width=8)
             Join Filter: (p.crawled_page_id <> c.source_id)

-> Index Scan using idx_page_id on page_content p(cost=0.00..174214.02 rows=428817 width=8)

             ->  Materialize  (cost=927576.16..1370855.12
    rows=31876196 width=4)
                   ->  Seq Scan on clause2 c  (cost=0.00..771182.96
    rows=31876196 width=4)
    (6 rows)


    Please guide me how to make the above query run faster as I am not
    able to do that.


    Thanks, Adarsh

    *

    *


Could you try just explaining the below query:

explain select distinct(p.crawled_page_id) from page_content p whereNOT EXISTS (select 1 from clause2 c where c.source_id =p.crawled_page_id);


The idea here is to avoid directly using NOT operator.



Regards,
Chetan

--
Chetan Suttraway

EnterpriseDB <http://www.enterprisedb.com/>, The Enterprise PostgreSQL<http://www.enterprisedb.com/> company.

Re: [PERFORM] Reason of Slowness of query

Reply via email to