We're hoping to get some suggestions as to improving the performance of a 3 table join we're carrying out. (I've stripped out some schema info to try to keep this post from getting too convoluted - if something doesn't make sense it maybe I've erroneously taken out something significant)
The 3 tables and indices are: \d branch_purchase_order Table "public.branch_purchase_order" Column | Type | Modifiers -------------------+--------------------------------+----------------------------------------------------------------------- po_id | integer | not null default nextval('branch_purchase_order_po_id_seq'::regclass) branch_code | character(2) | po_number | character varying(20) | supplier | character varying(50) | order_date | timestamp(0) without time zone | po_state | character varying(10) | Indexes: "branch_purchase_order_pkey" PRIMARY KEY, btree (po_id) "branch_po_unique_order_no_idx" UNIQUE, btree (branch_code, po_number) "branch_po_no_idx" btree (po_number) "branch_po_state_idx" btree (po_state) Referenced by: TABLE "branch_purchase_order_products" CONSTRAINT "branch_purchase_order_products_po_id_fkey" FOREIGN KEY (po_id) REFERENCES branch_purchase_order(po_id) ON DELETE CASCADE \d branch_purchase_order_products Table "public.branch_purchase_order_products" Column | Type | Modifiers --------------------+--------------------------------+----------- po_id | integer | product_code | character varying(20) | date_received | date | Indexes: "branch_purchase_order_product_code_idx" btree (product_code) "branch_purchase_order_product_po_idx" btree (po_id) "branch_purchase_order_products_date_received_idx" btree (date_received) Foreign-key constraints: "branch_purchase_order_products_po_id_fkey" FOREIGN KEY (po_id) REFERENCES branch_purchase_order(po_id) ON DELETE CASCADE \d stocksales_ib Table "public.stocksales_ib" Column | Type | Modifiers --------------+--------------------------------+----------- row_id | integer | branch_code | character(2) | product_code | character varying(20) | invoice_date | timestamp(0) without time zone | qty | integer | order_no | character varying(30) | Indexes: "ssales_ib_branch_idx" btree (branch_code) "ssales_ib_invoice_date_date_idx" btree ((invoice_date::date)) "ssales_ib_invoice_date_idx" btree (invoice_date) "ssales_ib_order_no" btree (order_no) "ssales_ib_product_idx" btree (product_code) "ssales_ib_replace_order_no" btree (replace(order_no::text, ' '::text, ''::text)) "ssales_ib_row_idx" btree (row_id) "stocksales_ib_branch_code_row_id_idx" btree (branch_code, row_id) "stocksales_ib_substring_idx" btree ("substring"(replace(order_no::text, ' '::text, ''::text), 3, 2)) The join we're using is: branch_purchase_order o join branch_purchase_order_products p using(po_id) join stocksales_ib ss on o.supplier=ss.branch_code and p.product_code=ss.product_code and X We have 3 different ways we have to do the final X join condition (we use 3 subqueries UNIONed together), but the one causing the issues is: (o.branch_code || o.po_number = replace(ss.order_no,' ','')) which joins branch_purchase_order to stocksales_ib under the following circumstances: ss.order_no | o.branch_code | o.po_number ----------------+---------------+----------- AA IN105394 | AA | IN105394 BB IN105311 | BB | IN105311 CC IN105311 | CC | IN105311 DD IN105310 | DD | IN105310 EE IN105310 | EE | IN105310 The entire query (leaving aside the UNION'ed subqueries for readability) looks like this: select po_id, product_code, sum(qty) as dispatch_qty, max(invoice_date) as dispatch_date, count(invoice_date) as dispatch_count from ( select o.po_id, p.product_code, ss.qty, ss.invoice_date from branch_purchase_order o join branch_purchase_order_products p using(po_id) join stocksales_ib ss on o.supplier=ss.branch_code and p.product_code=ss.product_code and (o.branch_code || o.po_number=replace(ss.order_no,' ','')) where o.po_state='PLACED' and o.supplier='XX' ) x group by po_id,product_code Explain output: http://explain.depesz.com/s/TzF8h QUERY PLAN ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- HashAggregate (cost=83263.72..83263.73 rows=1 width=24) (actual time=23908.777..23927.461 rows=52500 loops=1) Buffers: shared hit=23217993 dirtied=1 -> Nested Loop (cost=1.29..83263.71 rows=1 width=24) (actual time=0.196..23799.930 rows=53595 loops=1) Join Filter: (o.po_id = p.po_id) Rows Removed by Join Filter: 23006061 Buffers: shared hit=23217993 dirtied=1 -> Nested Loop (cost=0.86..57234.41 rows=3034 width=23) (actual time=0.162..129.508 rows=54259 loops=1) Buffers: shared hit=18520 -> Index Scan using branch_po_state_idx on branch_purchase_order o (cost=0.42..807.12 rows=1672 width=17) (actual time=0.037..4.863 rows=1916 loops=1) Index Cond: ((po_state)::text = 'PLACED'::text) Filter: ((supplier)::text = 'XX'::text) Rows Removed by Filter: 3050 Buffers: shared hit=2157 -> Index Scan using ssales_ib_replace_order_no on stocksales_ib ss (cost=0.44..33.74 rows=1 width=31) (actual time=0.014..0.044 rows=28 loops=1916) Index Cond: (replace((order_no)::text, ' '::text, ''::text) = ((o.branch_code)::text || (o.po_number)::text)) Filter: ((o.supplier)::bpchar = branch_code) Rows Removed by Filter: 0 Buffers: shared hit=16363 -> Index Scan using branch_purchase_order_product_code_idx on branch_purchase_order_products p (cost=0.43..5.45 rows=250 width=12) (actual time=0.018..0.335 rows=425 loops=54259) Index Cond: ((product_code)::text = (ss.product_code)::text) Buffers: shared hit=23199473 dirtied=1 Total runtime: 23935.995 ms (22 rows) So we can see straight away that the outer Nested loop expects 1 row, and gets 53595. This isn't going to help the planner pick the most efficient plan I suspect. I've tried increasing default_statistics_target to the max and re analysing all the tables involved but this does not help the estimate. I suspect it's due to the join being based on functional result meaning any stats are ignored? What has improved runtimes is using a WITH clause to carry out the first join explicitly. But although it runs in half the time, the stats are still way out and I feel it is maybe just because I'm limiting the planner's choices that it by chance picks a different, quicker, plan. It does a Hash Join and Seq Scan with bpo as ( select o.branch_code || o.po_number as order_no, o.po_id, o.supplier, o.branch_code, p.product_code from branch_purchase_order o join branch_purchase_order_products p using(po_id) where o.po_state='PLACED' and o.supplier='XX' ) select po_id, product_code, sum(qty) as dispatch_qty, max(invoice_date) as dispatch_date, count(invoice_date) as dispatch_count from ( select o.po_id, o.product_code, ss.qty, ss.invoice_date from bpo o join stocksales_ib ss on o.supplier=ss.branch_code and o.product_code=ss.product_code and o.order_no=replace(ss.order_no,' ','') ) x group by po_id,product_code Explain: http://explain.depesz.com/s/r7v Can anyone suggest a better approach for improving the plan for this type of query? select version(); version --------------------------------------------------------------------------------------------------------------- PostgreSQL 9.3.10 on x86_64-unknown-linux-gnu, compiled by gcc (GCC) 4.6.3 20120306 (Red Hat 4.6.3-2), 64-bit Regards, -- David