[
https://issues.apache.org/jira/browse/CONNECTORS-678?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13640315#comment-13640315
]
Erlend GarĂ¥sen commented on CONNECTORS-678:
-------------------------------------------
Basically no. We haven't configured any login methods for these hosts. We only
want to crawl their open pages.
foreninger.uio.no is the host for all the student organizations at the uni, and
therefore, it may include a lot of odd things. Regarding session auth, possibly
some PHP apps with such things, but we're not trying to access protected sites.
Regarding wiki.uio.no, I have filtered out the path to the SAML login page. I'm
only trying to fetch the open pages from wiki.
{code}
uio_manifoldcf=> \d prereqevents
Table "public.prereqevents"
Column | Type | Modifiers
-----------+------------------------+-----------
owner | bigint | not null
eventname | character varying(255) | not null
Indexes:
"i1362584563127" btree (owner)
Foreign-key constraints:
"prereqevents_owner_fkey" FOREIGN KEY (owner) REFERENCES jobqueue(id) ON
DELETE RESTRICT
{code}
Using the today + 24 hours for checktime (I worked in the opposite direction in
my previous post): 1366884777
{code}
uio_manifoldcf=> ANALYZE prereqevents;
ANALYZE
uio_manifoldcf=> EXPLAIN ANALYZE SELECT t0.id,t0.jobid,t0.dochash,t0.
docid,t0.status,t0.failtime,t0.failcount,t0.priorityset FROM jobqueue
t0 WHERE t0.status IN ('P','G') AND t0.checkaction='R' AND
t0.checktime<=1366884777 AND EXISTS(SELECT 'x' FROM jobs t1 WHERE
t1.status IN ('A','a') AND t1.id=t0.jobid AND t1.priority=5) AND
NOT EXISTS(SELECT 'x' FROM
jobqueue t2 WHERE t2.dochash=t0.dochash AND t2.status IN
('A','F','a','f','D','d') AND t2.jobid!=t0.jobid) AND
NOT EXISTS(SELECT 'x' FROM prereqevents t3,events t4 WHERE t0.id=t3.owner
AND t3.eventname=t4.name) ORDER BY t0.docpriority ASC LIMIT 100;
QUERY
PLAN
------------------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=10560.94..10560.94 rows=1 width=155) (actual time=32.915..33.109
rows=100 loops=1)
-> Sort (cost=10560.94..10560.94 rows=1 width=155) (actual
time=32.913..32.981 rows=100 loops=1)
Sort Key: t0.docpriority
Sort Method: top-N heapsort Memory: 51kB
-> Nested Loop Anti Join (cost=10536.77..10560.93 rows=1 width=155)
(actual time=0.853..30.546 rows=1737 loops=1)
Join Filter: (t2.jobid <> t0.jobid)
-> Nested Loop Semi Join (cost=10536.77..10552.32 rows=1
width=155) (actual time=0.834..14.502 rows=1737 loops=1)
-> Hash Anti Join (cost=10536.77..10550.89 rows=1
width=155) (actual time=0.822..6.614 rows=1737 loops=1)
Hash Cond: (t0.id = t3.owner)
-> Bitmap Heap Scan on jobqueue t0
(cost=4.61..18.39 rows=7 width=155) (actual time=0.685..3.482 rows=1737 loops=1)
Recheck Cond: ((status = ANY
('{P,G}'::bpchar[])) AND (checkaction = 'R'::bpchar) AND (checktime <=
1366884777))
-> Bitmap Index Scan on i1362584563123
(cost=0.00..4.61 rows=7 width=0) (actual time=0.627..0.627 rows=1737 loops=1)
Index Cond: ((status = ANY
('{P,G}'::bpchar[])) AND (checkaction = 'R'::bpchar) AND (checktime <=
1366884777))
-> Hash (cost=7636.98..7636.98 rows=231614
width=8) (actual time=0.025..0.025 rows=0 loops=1)
Buckets: 32768 Batches: 1 Memory Usage: 0kB
-> Hash Join (cost=13.15..7636.98
rows=231614 width=8) (actual time=0.023..0.023 rows=0 loops=1)
Hash Cond: ((t3.eventname)::text =
(t4.name)::text)
-> Seq Scan on prereqevents t3
(cost=0.00..4439.14 rows=231614 width=37) (actual time=0.003..0.003 rows=1
loops=1)
-> Hash (cost=11.40..11.40 rows=140
width=516) (actual time=0.002..0.002 rows=0 loops=1)
Buckets: 1024 Batches: 1 Memory
Usage: 0kB
-> Seq Scan on events t4
(cost=0.00..11.40 rows=140 width=516) (actual time=0.001..0.001 rows=0 loops=1)
-> Index Scan using jobs_pkey on jobs t1
(cost=0.00..1.42 rows=1 width=8) (actual time=0.002..0.002 rows=1 loops=1737)
Index Cond: (id = t0.jobid)
Filter: ((status = ANY ('{A,a}'::bpchar[])) AND
(priority = 5))
-> Index Scan using i1362584563120 on jobqueue t2
(cost=0.00..4.30 rows=1 width=49) (actual time=0.007..0.007 rows=0 loops=1737)
Index Cond: ((dochash)::text = (t0.dochash)::text)
Filter: (status = ANY ('{A,F,a,f,D,d}'::bpchar[]))
Total runtime: 33.275 ms
(28 rows)
{code}
> Postgresql generating terrible plans for stuffer queries
> --------------------------------------------------------
>
> Key: CONNECTORS-678
> URL: https://issues.apache.org/jira/browse/CONNECTORS-678
> Project: ManifoldCF
> Issue Type: Bug
> Components: Framework core
> Affects Versions: ManifoldCF 1.1.1
> Reporter: Karl Wright
> Assignee: Karl Wright
> Priority: Critical
> Fix For: ManifoldCF 1.2
>
>
> Query plans like this seem to be taking place:
> {code}
> WARN 2013-04-22 13:01:20,414 (Stuffer thread) - Plan: Limit
> (cost=9597.49..9597.49 rows=1 width=155)
> WARN 2013-04-22 13:01:20,414 (Stuffer thread) - Plan: -> Sort
> (cost=9597.49..9597.49 rows=1 width=155)
> WARN 2013-04-22 13:01:20,414 (Stuffer thread) - Plan: Sort Key:
> t0.docpriority
> WARN 2013-04-22 13:01:20,414 (Stuffer thread) - Plan: -> Nested
> Loop Anti Join (cost=17.68..9597.48 rows=1 width=155)
> WARN 2013-04-22 13:01:20,414 (Stuffer thread) - Plan: Join
> Filter: (t2.jobid <> t0.jobid)
> WARN 2013-04-22 13:01:20,414 (Stuffer thread) - Plan: ->
> Nested Loop Semi Join (cost=17.68..9588.87 rows=1 width=155)
> WARN 2013-04-22 13:01:20,414 (Stuffer thread) - Plan:
> Join Filter: (t0.jobid = t1.id)
> WARN 2013-04-22 13:01:20,414 (Stuffer thread) - Plan:
> -> Nested Loop Anti Join (cost=17.68..9586.81 rows=1 width=155)
> WARN 2013-04-22 13:01:20,414 (Stuffer thread) - Plan:
> Join Filter: (t0.id = t3.owner)
> WARN 2013-04-22 13:01:20,414 (Stuffer thread) - Plan:
> -> Bitmap Heap Scan on jobqueue t0 (cost=4.53..6.54 rows=1 width=155)
> WARN 2013-04-22 13:01:20,414 (Stuffer thread) - Plan:
> Recheck Cond: (status = ANY ('{P,G}'::bpchar[]))
> WARN 2013-04-22 13:01:20,414 (Stuffer thread) - Plan:
> Filter: ((checktime <= 1366628406182::bigint) AND (checkaction =
> 'R'::bpchar))
> WARN 2013-04-22 13:01:20,414 (Stuffer thread) - Plan:
> -> Bitmap Index Scan on i1362584563122 (cost=0.00..4.53 rows=1
> width=0)
> WARN 2013-04-22 13:01:20,414 (Stuffer thread) - Plan:
> Index Cond: (status = ANY ('{P,G}'::bpchar[]))
> WARN 2013-04-22 13:01:20,414 (Stuffer thread) - Plan:
> -> Hash Join (cost=13.15..7635.56 rows=231554 width=8)
> WARN 2013-04-22 13:01:20,414 (Stuffer thread) - Plan:
> Hash Cond: ((t3.eventname)::text = (t4.name)::text)
> WARN 2013-04-22 13:01:20,414 (Stuffer thread) - Plan:
> -> Seq Scan on prereqevents t3 (cost=0.00..4438.54 rows=231554
> width=37)
> WARN 2013-04-22 13:01:20,414 (Stuffer thread) - Plan:
> -> Hash (cost=11.40..11.40 rows=140 width=516)
> WARN 2013-04-22 13:01:20,414 (Stuffer thread) - Plan:
> -> Seq Scan on events t4 (cost=0.00..11.40 rows=140
> width=516)
> WARN 2013-04-22 13:01:20,414 (Stuffer thread) - Plan:
> -> Seq Scan on jobs t1 (cost=0.00..2.03 rows=2 width=8)
> WARN 2013-04-22 13:01:20,414 (Stuffer thread) - Plan:
> Filter: ((status = ANY ('{A,a}'::bpchar[])) AND (priority = 5::bigint))
> WARN 2013-04-22 13:01:20,414 (Stuffer thread) - Plan: ->
> Index Scan using i1362584563120 on jobqueue t2 (cost=0.00..4.30 rows=1
> width=49)
> WARN 2013-04-22 13:01:20,414 (Stuffer thread) - Plan:
> Index Cond: ((dochash)::text = (t0.dochash)::text)
> WARN 2013-04-22 13:01:20,414 (Stuffer thread) - Plan:
> Filter: (status = ANY ('{A,F,a,f,D,d}'::bpchar[]))
> WARN 2013-04-22 13:01:20,414 (Stuffer thread) -
> {code}
> Here's the query:
> {code}
> EXPLAIN SELECT
> t0.id,t0.jobid,t0.dochash,t0.docid,t0.status,t0.failtime,t0.failcount,t0.priorityset
> FROM jobqueue t0 WHERE t0.status IN ('P','G') AND t0.checkaction='R' AND
> t0.checktime<=1366628406182 AND
> EXISTS(SELECT 'x' FROM jobs t1 WHERE t1.status IN ('A','a') AND
> t1.id=t0.jobid AND t1.priority=5) AND
> NOT EXISTS(SELECT 'x' FROM jobqueue t2 WHERE t2.dochash=t0.dochash AND
> t2.status IN ('A','F','a','f','D','d') AND t2.jobid!=t0.jobid) AND
> NOT EXISTS(SELECT 'x' FROM prereqevents t3,events t4 WHERE t0.id=t3.owner
> AND t3.eventname=t4.name) ORDER BY t0.docpriority ASC LIMIT 100;
> {code}
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira