On Wed, 27 Apr 2016 07:28 Tim van der Linden, <t...@shisaa.jp> wrote:
> Hi all > > I have asked this question in a somewhat different form on the DBA > Stackexchange site, but without much luck ( > https://dba.stackexchange.com/questions/136423/postgresql-slow-join-on-three-large-tables). > So I apologize for possible double posting, but I hope this might get a > better response on the mailing list. > > I'm joining three fairly large tables together, and it is slow. The tables > are: > > - "reports": 6 million rows > - "report_drugs": 20 million rows > - "report_adverses": 20 million rows > > The table "reports" holds main report data and has a primary key column > "id". The other two tables have a foreign key to that table with "rid". It > are those columns that I use to join them together. > > All tables have indexes on the "id"/"rid" columns and on the > "drug"/"adverse" columns. > > The query: > > SELECT r.id, r.age, r.gender, r.created, a.adverse, d.drug > FROM reports r > JOIN report_drugs d ON d.rid = r.id > JOIN report_adverses a ON a.rid = r.id > WHERE a.adverse = ANY (ARRAY['back pain - nonspecific', 'nonspecific back > pain', 'back pain']) > AND d.drug = ANY (ARRAY[359, 360, 361, 362, 363]) ORDER BY r.created; > > The plan: > > Sort (cost=105773.63..105774.46 rows=333 width=76) (actual > time=5143.162..5143.185 rows=448 loops=1) > Sort Key: r.created > Sort Method: quicksort Memory: 60kB > -> Nested Loop (cost=1.31..105759.68 rows=333 width=76) (actual > time=54.784..5142.872 rows=448 loops=1) > Join Filter: (d.rid = a.rid) > -> Nested Loop (cost=0.87..94657.59 rows=14005 width=72) (actual > time=0.822..2038.952 rows=14199 loops=1) > Quite clearly the nested loop joins are the most costly operations here. -> Index Scan using report_drugs_drug_idx on report_drugs d > (cost=0.44..500.28 rows=14005 width=31) (actual time=0.669..3.900 > rows=14199 loops=1) > Index Cond: (drug = ANY > ('{359,360,361,362,363}'::integer[])) > -> Index Scan using reports_id_key on reports r > (cost=0.43..6.71 rows=1 width=41) (actual time=0.143..0.143 rows=1 > loops=14199) > Index Cond: (id = d.rid) > -> Index Scan using report_adverses_rid_idx on report_adverses a > (cost=0.44..0.78 rows=1 width=12) (actual time=0.218..0.218 rows=0 > loops=14199) > Index Cond: (rid = r.id) > Filter: (adverse = ANY ('{"back pain - > nonspecific","nonspecific back pain","back pain"}'::text[])) > Rows Removed by Filter: 5 > I suppose. It might help if the filters are performed before the join. I am not an expert on optimizer but I guess it might help if you change the join order and add duplicate conditions for reports- SELECT r.id, r.age, r.gender, r.created, a.adverse, d.drug FROM report_drugs d JOIN report_adverses a ON a.rid = d.rid JOIN reports r ON d.rid = r.id WHERE a.adverse = ANY (ARRAY['back pain - nonspecific', 'nonspecific back pain', 'back pain']) AND d.drug = ANY (ARRAY[359, 360, 361, 362, 363]) ORDER BY r.created; OR since you are using INNER JOIN, (As far as I understand the concept of joins) it won't hurt the result set if the where clause is pushed into the INNER JOIN criteria- SELECT r.id, r.age, r.gender, r.created, a.adverse, d.drug FROM report_drugs d JOIN report_adverses a ON a.rid = d.rid AND a.adverse = ANY (ARRAY['back pain - nonspecific', 'nonspecific back pain', 'back pain']) AND d.drug = ANY (ARRAY[359, 360, 361, 362, 363]) ORDER BY r.created; JOIN reports r ON d.rid = r.id; Planning time: 13.994 ms > Execution time: 5143.235 ms > > This takes well over 5 seconds, which to me, feels much too slow. > Well given your hardware (which seems to be pretty decent), I am not sure if this is an unfair expectation. > If I query each table directly with the same conditions, thus: > > SELECT reason > FROM report_drugs > WHERE drug = ANY (ARRAY[359, 360, 361, 362, 363]); > > I get: > > Index Scan using report_drugs_drug_idx on report_drugs (cost=0.44..500.28 > rows=14005 width=27) (actual time=0.621..4.510 rows=14199 loops=1) > Index Cond: (drug = ANY ('{359,360,361,362,363}'::integer[])) > Planning time: 6.939 ms > Execution time: 4.759 ms > > Under 5 ms. The same goes for querying the "adverse" column in the > "report_adverses" table: under 20 ms. > > This indicates to me that indeed the join itself causes a major > performance bottleneck. > > I'm running the cluster from an SSD drive, as a traditional HDD could not > even manage the query in under 5 minutes. The system has a total memory of > 24 GB, runs on Debian and uses an 4Ghz 8 core i7-4790 processor. > > Some important postgresql.conf readouts: > > - shared_buffers = 4GB > - work_mem = 64MB > - maintenance_work_mem = 1GB > - checkpoint_segments = 50 > - checkpoint_completion_target = 0.9 > - autovacuum = on > > Is there something I am missing here? Any help on getting this join faster > is much appreciated. > > Cheers, > Tim > > > -- > Sent via pgsql-general mailing list (pgsql-general@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-general > -- -- Best Regards Sameer Kumar | DB Solution Architect *ASHNIK PTE. LTD.* 101 Cecil Street, #11-11 Tong Eng Building, Singapore 069 533 T: +65 6438 3504 | M: +65 8110 0350 | www.ashnik.com