Re: [GENERAL] Slow join over three tables

Tim van der Linden Wed, 27 Apr 2016 03:30:08 -0700

On Wed, 27 Apr 2016 14:09:04 +1200
David Rowley <[email protected]> wrote:


Hi David

Thanks for your time on this. I tried your proposals with the results below.

> > ...
> > Under 5 ms. The same goes for querying the "adverse" column in the 
> > "report_adverses" table: under 20 ms.
> 
> I'm not sure why you're comparing this to the join plan above. They're
> very different, I can only imagine that it' because you've not quite
> understood what the EXPLAIN output is saying.

Well, it is more due to me poorly wording the question I guess. I understand 
that both plans are totally different and both queries are fairly incomparable 
as two very different things are happening.

My point here was to show that the "drug" column seems to be indexed correctly 
and, even though it is a very large table, the query executes very fast (all 
these execution times are given after flushing the OS's disk cache).
 
> >      ->  Nested Loop  (cost=0.87..94657.59 rows=14005 width=72) (actual 
> > time=0.822..2038.952 rows=14199 loops=1)
> >          ->  Index Scan using report_drugs_drug_idx on report_drugs d  
> > (cost=0.44..500.28 rows=14005 width=31) (actual time=0.669..3.900 
> > rows=14199 loops=1)
> >                Index Cond: (drug = ANY ('{359,360,361,362,363}'::integer[]))
> >          ->  Index Scan using reports_id_key on reports r  (cost=0.43..6.71 
> > rows=1 width=41) (actual time=0.143..0.143 rows=1 loops=14199)
> >                Index Cond: (id = d.rid)
> 
> This is a parameterised nested loop. This means that the inner side of
> the loop (reports), is parameterised by the outerside of the loop, you
> can see the d.rid in the Index Cond. This means that the
> reports_id_key index is looked up 14199 times. You can see from the
> plan that the nested loop here took 2038.952 milliseconds to complete,
> so about 0.144 ms per lookup, not too bad, right?

That is indeed not bad at all, thanks for pointing this out. This nested loop 
is thus responsible for little over 40% of the execution time.

> The alternative to a parameterised nested loop plan is that a Hash
> Join plan is generated. If you imagine what would happen here, likely
> the matching report_drugs records would be hashed, then the outer side
> of the join would then perform a SeqScan over the entire 6 million
> reports records... probably not better. You can test this yourself
> with; SET enable_nestloop = 0; you might need to SET enable_mergejoin
> = 0; too. I imagine it'll be slower.

Doing this with the current (unaltered) indexes almost doubles the time and it 
seems to behave exactly as you mentioned:

 Sort  (cost=372968.28..372969.07 rows=317 width=41) (actual 
time=9308.174..9308.187 rows=448 loops=1)
   Sort Key: r.created
   Sort Method: quicksort  Memory: 60kB
   ->  Hash Join  (cost=130941.07..372955.11 rows=317 width=41) (actual 
time=8131.509..9307.983 rows=448 loops=1)
         Hash Cond: (r.id = d.rid)
         ->  Seq Scan on reports r  (cost=0.00..105465.00 rows=5749300 
width=20) (actual time=0.237..849.564 rows=5749300 loops=1)
         ->  Hash  (cost=130937.10..130937.10 rows=317 width=29) (actual 
time=8130.175..8130.175 rows=448 loops=1)
               Buckets: 1024  Batches: 1  Memory Usage: 25kB
               ->  Hash Join  (cost=122150.13..130937.10 rows=317 width=29) 
(actual time=8118.130..8130.090 rows=448 loops=1)
                     Hash Cond: (d.rid = a.rid)
                     ->  Index Scan using report_drugs_drug_idx on report_drugs 
d  (cost=0.44..496.92 rows=13870 width=8) (actual time=0.830..3.503 rows=14200 
loops=1)
                           Index Cond: (drug = ANY 
('{359,360,361,362,363}'::integer[]))
                     ->  Hash  (cost=121170.69..121170.69 rows=78320 width=21) 
(actual time=8116.544..8116.544 rows=76972 loops=1)
                           Buckets: 8192  Batches: 1  Memory Usage: 3609kB
                           ->  Bitmap Heap Scan on report_adverses a  
(cost=2024.68..121170.69 rows=78320 width=21) (actual time=19.303..8093.257 
rows=76972 loops=1)
                                 Recheck Cond: (adverse = ANY ('{"back pain - 
nonspecific","nonspecific back pain","back pain"}'::text[]))
                                 Heap Blocks: exact=56959
                                 ->  Bitmap Index Scan on 
report_adverses_adverse_idx  (cost=0.00..2005.10 rows=78320 width=0) (actual 
time=11.756..11.756 rows=76972 loops=1)
                                       Index Cond: (adverse = ANY ('{"back pain 
- nonspecific","nonspecific back pain","back pain"}'::text[]))
 Planning time: 11.348 ms
 Execution time: 9308.694 ms

> Likely the only way to speed this up would be to create indexes;
> 
> create index on reports (id, age, gender, created);
> ...
> create index on report_adverses (rid, adverse);
> 
> this might speedup the 2nd join a bit.

Done. This does help a bit. The query runs about ~1 second faster. The plan:

 Sort  (cost=66065.73..66066.59 rows=344 width=41) (actual 
time=4313.679..4313.708 rows=448 loops=1)
   Sort Key: r.created
   Sort Method: quicksort  Memory: 60kB
   ->  Nested Loop  (cost=1.44..66051.23 rows=344 width=41) (actual 
time=43.987..4313.435 rows=448 loops=1)
         Join Filter: (d.rid = a.rid)
         ->  Nested Loop  (cost=1.00..54700.19 rows=14319 width=28) (actual 
time=0.772..1158.338 rows=14200 loops=1)
               ->  Index Scan using report_drugs_drug_idx on report_drugs d  
(cost=0.44..507.78 rows=14319 width=8) (actual time=0.579..4.327 rows=14200 
loops=1)
                     Index Cond: (drug = ANY 
('{359,360,361,362,363}'::integer[]))
               ->  Index Only Scan using reports_id_age_gender_created_idx on 
reports r  (cost=0.56..3.77 rows=1 width=20) (actual time=0.081..0.081 rows=1 
loops=14200)
                     Index Cond: (id = d.rid)
                     Heap Fetches: 0
         ->  Index Scan using report_adverses_rid_idx on report_adverses a  
(cost=0.44..0.78 rows=1 width=21) (actual time=0.222..0.222 rows=0 loops=14200)
               Index Cond: (rid = r.id)
               Filter: (adverse = ANY ('{"back pain - nonspecific","nonspecific 
back pain","back pain"}'::text[]))
               Rows Removed by Filter: 5
 Planning time: 15.968 ms
 Execution time: 4313.755 ms

Both the (rid, adverse) and the (id, age, gender, created) indexes are now used.

Yet ... this is "only" 1 second faster, still 4 seconds to complete this query 
(the join(s)). Is this truly the most that I could get out of it ... or could 
further tweaking to PostgreSQL's configuration be of any help here?
 
>  David Rowley                   http://www.2ndQuadrant.com/
>  PostgreSQL Development, 24x7 Support, Training & Services

Cheers,
Tim


-- 
Sent via pgsql-general mailing list ([email protected])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: [GENERAL] Slow join over three tables

Reply via email to