[ 
https://issues.apache.org/jira/browse/CONNECTORS-678?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13640315#comment-13640315
 ] 

Erlend GarĂ¥sen commented on CONNECTORS-678:
-------------------------------------------

Basically no. We haven't configured any login methods for these hosts. We only 
want to crawl their open pages.

foreninger.uio.no is the host for all the student organizations at the uni, and 
therefore, it may include a lot of odd things. Regarding session auth, possibly 
some PHP apps with such things, but we're not trying to access protected sites.

Regarding wiki.uio.no, I have filtered out the path to the SAML login page. I'm 
only trying to fetch the open pages from wiki.

{code}
uio_manifoldcf=> \d prereqevents
          Table "public.prereqevents"
  Column   |          Type          | Modifiers 
-----------+------------------------+-----------
 owner     | bigint                 | not null
 eventname | character varying(255) | not null
Indexes:
    "i1362584563127" btree (owner)
Foreign-key constraints:
    "prereqevents_owner_fkey" FOREIGN KEY (owner) REFERENCES jobqueue(id) ON 
DELETE RESTRICT
{code}

Using the today + 24 hours for checktime (I worked in the opposite direction in 
my previous post): 1366884777

{code}
uio_manifoldcf=> ANALYZE prereqevents;
ANALYZE
uio_manifoldcf=> EXPLAIN ANALYZE SELECT t0.id,t0.jobid,t0.dochash,t0.           
                                                                                
          docid,t0.status,t0.failtime,t0.failcount,t0.priorityset FROM jobqueue 
t0  WHERE t0.status IN ('P','G') AND t0.checkaction='R' AND 
t0.checktime<=1366884777 AND             EXISTS(SELECT 'x' FROM jobs t1 WHERE 
t1.status IN ('A','a') AND t1.id=t0.jobid AND t1.priority=5) AND                
                                                     NOT EXISTS(SELECT 'x' FROM 
jobqueue t2 WHERE t2.dochash=t0.dochash AND t2.status IN 
('A','F','a','f','D','d') AND t2.jobid!=t0.jobid) AND                           
      NOT EXISTS(SELECT 'x' FROM prereqevents t3,events t4 WHERE t0.id=t3.owner 
AND t3.eventname=t4.name) ORDER BY t0.docpriority ASC LIMIT 100;
                                                                         QUERY 
PLAN                                                                         
------------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=10560.94..10560.94 rows=1 width=155) (actual time=32.915..33.109 
rows=100 loops=1)
   ->  Sort  (cost=10560.94..10560.94 rows=1 width=155) (actual 
time=32.913..32.981 rows=100 loops=1)
         Sort Key: t0.docpriority
         Sort Method: top-N heapsort  Memory: 51kB
         ->  Nested Loop Anti Join  (cost=10536.77..10560.93 rows=1 width=155) 
(actual time=0.853..30.546 rows=1737 loops=1)
               Join Filter: (t2.jobid <> t0.jobid)
               ->  Nested Loop Semi Join  (cost=10536.77..10552.32 rows=1 
width=155) (actual time=0.834..14.502 rows=1737 loops=1)
                     ->  Hash Anti Join  (cost=10536.77..10550.89 rows=1 
width=155) (actual time=0.822..6.614 rows=1737 loops=1)
                           Hash Cond: (t0.id = t3.owner)
                           ->  Bitmap Heap Scan on jobqueue t0  
(cost=4.61..18.39 rows=7 width=155) (actual time=0.685..3.482 rows=1737 loops=1)
                                 Recheck Cond: ((status = ANY 
('{P,G}'::bpchar[])) AND (checkaction = 'R'::bpchar) AND (checktime <= 
1366884777))
                                 ->  Bitmap Index Scan on i1362584563123  
(cost=0.00..4.61 rows=7 width=0) (actual time=0.627..0.627 rows=1737 loops=1)
                                       Index Cond: ((status = ANY 
('{P,G}'::bpchar[])) AND (checkaction = 'R'::bpchar) AND (checktime <= 
1366884777))
                           ->  Hash  (cost=7636.98..7636.98 rows=231614 
width=8) (actual time=0.025..0.025 rows=0 loops=1)
                                 Buckets: 32768  Batches: 1  Memory Usage: 0kB
                                 ->  Hash Join  (cost=13.15..7636.98 
rows=231614 width=8) (actual time=0.023..0.023 rows=0 loops=1)
                                       Hash Cond: ((t3.eventname)::text = 
(t4.name)::text)
                                       ->  Seq Scan on prereqevents t3  
(cost=0.00..4439.14 rows=231614 width=37) (actual time=0.003..0.003 rows=1 
loops=1)
                                       ->  Hash  (cost=11.40..11.40 rows=140 
width=516) (actual time=0.002..0.002 rows=0 loops=1)
                                             Buckets: 1024  Batches: 1  Memory 
Usage: 0kB
                                             ->  Seq Scan on events t4  
(cost=0.00..11.40 rows=140 width=516) (actual time=0.001..0.001 rows=0 loops=1)
                     ->  Index Scan using jobs_pkey on jobs t1  
(cost=0.00..1.42 rows=1 width=8) (actual time=0.002..0.002 rows=1 loops=1737)
                           Index Cond: (id = t0.jobid)
                           Filter: ((status = ANY ('{A,a}'::bpchar[])) AND 
(priority = 5))
               ->  Index Scan using i1362584563120 on jobqueue t2  
(cost=0.00..4.30 rows=1 width=49) (actual time=0.007..0.007 rows=0 loops=1737)
                     Index Cond: ((dochash)::text = (t0.dochash)::text)
                     Filter: (status = ANY ('{A,F,a,f,D,d}'::bpchar[]))
 Total runtime: 33.275 ms
(28 rows)
{code}
                
> Postgresql generating terrible plans for stuffer queries
> --------------------------------------------------------
>
>                 Key: CONNECTORS-678
>                 URL: https://issues.apache.org/jira/browse/CONNECTORS-678
>             Project: ManifoldCF
>          Issue Type: Bug
>          Components: Framework core
>    Affects Versions: ManifoldCF 1.1.1
>            Reporter: Karl Wright
>            Assignee: Karl Wright
>            Priority: Critical
>             Fix For: ManifoldCF 1.2
>
>
> Query plans like this seem to be taking place:
> {code}
>  WARN 2013-04-22 13:01:20,414 (Stuffer thread) -  Plan: Limit  
> (cost=9597.49..9597.49 rows=1 width=155)
>  WARN 2013-04-22 13:01:20,414 (Stuffer thread) -  Plan:   ->  Sort  
> (cost=9597.49..9597.49 rows=1 width=155)
>  WARN 2013-04-22 13:01:20,414 (Stuffer thread) -  Plan:         Sort Key: 
> t0.docpriority
>  WARN 2013-04-22 13:01:20,414 (Stuffer thread) -  Plan:         ->  Nested 
> Loop Anti Join  (cost=17.68..9597.48 rows=1 width=155)
>  WARN 2013-04-22 13:01:20,414 (Stuffer thread) -  Plan:               Join 
> Filter: (t2.jobid <> t0.jobid)
>  WARN 2013-04-22 13:01:20,414 (Stuffer thread) -  Plan:               ->  
> Nested Loop Semi Join  (cost=17.68..9588.87 rows=1 width=155)
>  WARN 2013-04-22 13:01:20,414 (Stuffer thread) -  Plan:                     
> Join Filter: (t0.jobid = t1.id)
>  WARN 2013-04-22 13:01:20,414 (Stuffer thread) -  Plan:                     
> ->  Nested Loop Anti Join  (cost=17.68..9586.81 rows=1 width=155)
>  WARN 2013-04-22 13:01:20,414 (Stuffer thread) -  Plan:                       
>     Join Filter: (t0.id = t3.owner)
>  WARN 2013-04-22 13:01:20,414 (Stuffer thread) -  Plan:                       
>     ->  Bitmap Heap Scan on jobqueue t0  (cost=4.53..6.54 rows=1 width=155)
>  WARN 2013-04-22 13:01:20,414 (Stuffer thread) -  Plan:                       
>           Recheck Cond: (status = ANY ('{P,G}'::bpchar[]))
>  WARN 2013-04-22 13:01:20,414 (Stuffer thread) -  Plan:                       
>           Filter: ((checktime <= 1366628406182::bigint) AND (checkaction = 
> 'R'::bpchar))
>  WARN 2013-04-22 13:01:20,414 (Stuffer thread) -  Plan:                       
>           ->  Bitmap Index Scan on i1362584563122  (cost=0.00..4.53 rows=1 
> width=0)
>  WARN 2013-04-22 13:01:20,414 (Stuffer thread) -  Plan:                       
>                 Index Cond: (status = ANY ('{P,G}'::bpchar[]))
>  WARN 2013-04-22 13:01:20,414 (Stuffer thread) -  Plan:                       
>     ->  Hash Join  (cost=13.15..7635.56 rows=231554 width=8)
>  WARN 2013-04-22 13:01:20,414 (Stuffer thread) -  Plan:                       
>           Hash Cond: ((t3.eventname)::text = (t4.name)::text)
>  WARN 2013-04-22 13:01:20,414 (Stuffer thread) -  Plan:                       
>           ->  Seq Scan on prereqevents t3  (cost=0.00..4438.54 rows=231554 
> width=37)
>  WARN 2013-04-22 13:01:20,414 (Stuffer thread) -  Plan:                       
>           ->  Hash  (cost=11.40..11.40 rows=140 width=516)
>  WARN 2013-04-22 13:01:20,414 (Stuffer thread) -  Plan:                       
>                 ->  Seq Scan on events t4  (cost=0.00..11.40 rows=140 
> width=516)
>  WARN 2013-04-22 13:01:20,414 (Stuffer thread) -  Plan:                     
> ->  Seq Scan on jobs t1  (cost=0.00..2.03 rows=2 width=8)
>  WARN 2013-04-22 13:01:20,414 (Stuffer thread) -  Plan:                       
>     Filter: ((status = ANY ('{A,a}'::bpchar[])) AND (priority = 5::bigint))
>  WARN 2013-04-22 13:01:20,414 (Stuffer thread) -  Plan:               ->  
> Index Scan using i1362584563120 on jobqueue t2  (cost=0.00..4.30 rows=1 
> width=49)
>  WARN 2013-04-22 13:01:20,414 (Stuffer thread) -  Plan:                     
> Index Cond: ((dochash)::text = (t0.dochash)::text)
>  WARN 2013-04-22 13:01:20,414 (Stuffer thread) -  Plan:                     
> Filter: (status = ANY ('{A,F,a,f,D,d}'::bpchar[]))
>  WARN 2013-04-22 13:01:20,414 (Stuffer thread) - 
> {code}
> Here's the query:
> {code}
> EXPLAIN SELECT 
> t0.id,t0.jobid,t0.dochash,t0.docid,t0.status,t0.failtime,t0.failcount,t0.priorityset
>  FROM jobqueue t0  WHERE t0.status IN ('P','G') AND t0.checkaction='R' AND 
> t0.checktime<=1366628406182 AND
>     EXISTS(SELECT 'x' FROM jobs t1 WHERE t1.status IN ('A','a') AND 
> t1.id=t0.jobid AND t1.priority=5) AND
>     NOT EXISTS(SELECT 'x' FROM jobqueue t2 WHERE t2.dochash=t0.dochash AND 
> t2.status IN ('A','F','a','f','D','d') AND t2.jobid!=t0.jobid) AND
>     NOT EXISTS(SELECT 'x' FROM prereqevents t3,events t4 WHERE t0.id=t3.owner 
> AND t3.eventname=t4.name) ORDER BY t0.docpriority ASC LIMIT 100;
> {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to