Re: (Continuous) crawl performance

Aeham Abushwashi Thu, 30 Oct 2014 17:29:51 -0700

Hi Karl,

I do have that change. Paul Boichat, who was involved in the conversation
leading up to CONNECTORS-1027, applied the fix and tested it in our
environment.


I see your point that query #1 still has the IN (...) clause indicating
that it hasn't benefited from the CONNECTORS-1027 fix. However, in my
execution plan that part of the query is never executed anyway - see below
for details. In addition, query #2 is clean yet it hits the database hard.

======================

Execution Plan for Query #1:

 EXPLAIN ANALYZE SELECT
t0.id,t0.docpriority,t0.jobid,t0.dochash,t0.docid,t0.status,t0.failtime,t0.failcount,t0.priorityset
FROM jobqueue t0  WHERE  (t0.status='P' OR t0.status='G') AND
t0.checkaction='R' AND t0.checktime<= 1414683318802 AND EXISTS(SELECT 'x'
FROM jobs t1 WHERE  (t1.status='A' OR t1.status='a') AND t1.id=t0.jobid AND
t1.priority=5) AND NOT EXISTS(SELECT 'x' FROM jobqueue t2 WHERE
t2.dochash=t0.dochash AND t2.status IN ('A','F','a','f','D','d') AND
t2.jobid!=t0.jobid) AND NOT EXISTS(SELECT 'x' FROM prereqevents t3,events
t4 WHERE t0.id=t3.owner AND t3.eventname=t4.name) ORDER BY t0.docpriority
ASC LIMIT 280;

QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=0.00..85910.03 rows=280 width=178) (actual
time=15152.258..15152.258 rows=0 loops=1)
   ->  Nested Loop Anti Join  (cost=0.00..3187261.91 rows=10388 width=178)
(actual time=15152.255..15152.255 rows=0 loops=1)
         Join Filter: ((t2.jobid <> t0.jobid) AND ((t2.dochash)::text =
(t0.dochash)::text))
         ->  Nested Loop Anti Join  (cost=0.00..3186114.69 rows=10388
width=178) (actual time=15152.253..15152.253 rows=0 loops=1)
               ->  Nested Loop Semi Join  (cost=0.00..3177053.22 rows=10388
width=178) (actual time=15152.252..15152.252 rows=0 loops=1)
                     ->  Index Scan using i1392985450172 on jobqueue t0
(cost=0.00..3141756.78 rows=124662 width=178) (actual
time=686.159..4046.926 rows=3346768 loops=1)
                           Index Cond: ((checkaction = 'R'::bpchar) AND
(checktime <= 1414683318802::bigint))
                           Filter: ((status = 'P'::bpchar) OR (status =
'G'::bpchar))
                     ->  Index Scan using jobs_pkey on jobs t1
(cost=0.00..0.28 rows=1 width=8) (actual time=0.003..0.003 rows=0
loops=3346768)
                           Index Cond: (id = t0.jobid)
                           Filter: ((priority = 5) AND ((status =
'A'::bpchar) OR (status = 'a'::bpchar)))
                           Rows Removed by Filter: 1
               ->  Nested Loop  (cost=0.00..0.86 rows=1 width=8) (never
executed)
                     ->  Index Scan using i1392985450178 on prereqevents
t3  (cost=0.00..0.27 rows=1 width=524) (never executed)
                           Index Cond: (t0.id = owner)
                     ->  Index Only Scan using events_pkey on events t4
(cost=0.00..0.58 rows=1 width=516) (never executed)
                           Index Cond: (name = (t3.eventname)::text)
                           Heap Fetches: 0
         ->  Materialize  (cost=0.00..56.48 rows=6 width=49) (never
executed)
               ->  Index Scan using i1392985450173 on jobqueue t2
(cost=0.00..56.45 rows=6 width=49) (never executed)
                     Index Cond: (status = ANY ('{A,F,a,f,D,d}'::bpchar[]))
 Total runtime: 15152.487 ms


Execution Plan for Query #2:

EXPLAIN ANALYZE SELECT docpriority,jobid,dochash,docid FROM jobqueue t0
WHERE  (status='P' OR status='G') AND checkaction='R' AND checktime<=
1414685283327 AND EXISTS(SELECT 'x' FROM jobs t1 WHERE  (t1.status='A' OR
t1.status='a') AND t1.id=t0.jobid)  ORDER BY docpriority ASC LIMIT 1;

QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=0.00..304.78 rows=1 width=144) (actual
time=20620.458..20620.458 rows=0 loops=1)
   ->  Nested Loop Semi Join  (cost=0.00..3029548.13 rows=9940 width=144)
(actual time=20620.456..20620.456 rows=0 loops=1)
         ->  Index Scan using i1392985450172 on jobqueue t0
(cost=0.00..2996093.69 rows=119282 width=144) (actual time=44.118..5665.201
rows=3346768 loops=1)
               Index Cond: ((checkaction = 'R'::bpchar) AND (checktime <=
1414685283327::bigint))
               Filter: ((status = 'P'::bpchar) OR (status = 'G'::bpchar))
         ->  Index Scan using jobs_pkey on jobs t1  (cost=0.00..0.28 rows=1
width=8) (actual time=0.004..0.004 rows=0 loops=3346768)
               Index Cond: (id = t0.jobid)
               Filter: ((status = 'A'::bpchar) OR (status = 'a'::bpchar))
               Rows Removed by Filter: 1
 Total runtime: 20620.642 ms

======================


Best regards,
Aeham

On 30 October 2014 23:45, Karl Wright <[email protected]> wrote:

> The ticket, in case you are interested, is CONNECTORS-1027.
>
> Karl
>
>
> On Thu, Oct 30, 2014 at 7:42 PM, Karl Wright <[email protected]> wrote:
>
> > Hi Aeham,
> >
> > The problem is that you are not using MCF 1.7.1, which has a fix for the
> > stuffer thread query that makes it run quickly on PostgreSQL 9.1 and
> above.
> >
> > Thanks,
> > Karl
> >
> >
> > On Thu, Oct 30, 2014 at 7:30 PM, Aeham Abushwashi <
> > [email protected]> wrote:
> >
> >> Hi,
> >>
> >> I have a multi-node zookeeper-based test environment with ~29M documents
> >> (~36M jobqueue records and ~29M ingeststatus records) and jobs in
> varying
> >> states of being (running vs complete vs stopped vs paused). Some jobs
> are
> >> continuous and a few are not.
> >>
> >> During a quiet time -when no documents are actively ingested- I see very
> >> high CPU utilisation on the Postgresql database server. This is
> primarily
> >> a
> >> result of two queries executed by the Stuffer Thread; namely the main
> >> query
> >> in JobManager#fetchAndProcessDocuments and the doc priority query in
> >> JobManager#getNextDocuments
> >>
> >> ==========================
> >>
> >> Query 1:
> >> SELECT t0.id
> >>
> ,t0.docpriority,t0.jobid,t0.dochash,t0.docid,t0.status,t0.failtime,t0.failcount,t0.priorityset
> >> FROM jobqueue t0  WHERE  (t0.status=? OR t0.status=?) AND
> t0.checkaction=?
> >> AND t0.checktime<=? AND EXISTS(SELECT 'x' FROM jobs t1 WHERE
> (t1.status=?
> >> OR t1.status=?) AND t1.id=t0.jobid AND t1.priority=?) AND NOT
> >> EXISTS(SELECT
> >> 'x' FROM jobqueue t2 WHERE t2.dochash=t0.dochash AND t2.status IN
> >> (?,?,?,?,?,?) AND t2.jobid!=t0.jobid) AND NOT EXISTS(SELECT 'x' FROM
> >> prereqevents t3,events t4 WHERE t0.id=t3.owner AND t3.eventname=t4.name
> )
> >> ORDER BY t0.docpriority ASC LIMIT 280
> >>
> >> Query 1 example params:
> >> [P, G, R, 1414683318802, A, a, 5, A, F, a, f, D, d]
> >>
> >> ++++++++
> >>
> >> Query 2:
> >> SELECT docpriority,jobid,dochash,docid FROM jobqueue t0  WHERE
> (status=?
> >> OR status=?) AND checkaction=? AND checktime<=? AND EXISTS(SELECT 'x'
> FROM
> >> jobs t1 WHERE  (t1.status=? OR t1.status=?) AND t1.id=t0.jobid)  ORDER
> BY
> >> docpriority ASC LIMIT 1
> >>
> >> Query 2 example params:
> >> [P, G, R, 1414685283327, A, a]
> >>
> >> ==========================
> >>
> >> There is one continuous job in a running state and which has ~50K
> >> documents
> >> in a pending-purgatory ('G') state but not due for processing until
> >> another
> >> couple of days. There are a few other jobs which are stopped and have a
> >> few
> >> million documents in a pending-purgatory state. I don't expect Manifold
> to
> >> find any documents eligible for processing in this case and it doesn't,
> >> but
> >> it takes time and (more seriously) a big chunk of Postgresql CPU to find
> >> the answer, 1 full CPU core for ~15 secs per MCF instance. This wouldn't
> >> be
> >> a big deal if the queries were executed infrequently. Unfortunately,
> they
> >> are run in a loop with a 2 second pause between iterations.
> >> I'd argue that it would be relatively easy to run into this issue with a
> >> bunch of large continuous jobs scheduled to run at varying times of the
> >> day.
> >>
> >> With my limited knowledge of the inner workings of the Stuffer Thread,
> >> it's
> >> not obvious to me why query 2 needs to run IF query 1 returns no hits at
> >> all. If it's not needed, then short-circuitting it in this case would
> >> help.
> >>
> >> Increasing the sleep time between iterations of the Stuffer Thread, in
> >> response to there being little or no work for a number of consecutive
> >> iterations, could be another easy win.
> >>
> >> More profoundly would be to optimise the joins between the jobs table
> and
> >> the jobqueue table. One way to do that would be to move some of the
> logic
> >> to the Java code to first identify active jobs and then query the
> jobqueue
> >> table just for those. I understand this would require merging and
> sorting
> >> priorities in the Java code which is less than ideal. I continue to
> >> analyse
> >> the execution plans of the two queries and experiment with slight
> variants
> >> to test performance. I'll report back any important findings.
> >>
> >> Any thoughts or advice are welcome. Also if have experience running
> large
> >> continuous crawls I'd be interested to hear from you particularly on
> your
> >> experience with performance and especially the performance of SQL
> queries.
> >>
> >> Best regards,
> >> Aeham
> >>
> >
> >
>

Re: (Continuous) crawl performance

Reply via email to