I'm having some performance problems in a few sales reports running on postgres
8.3, running on Redhat 4.1.2. The hardware is a bit old, but it performs well
enough. The reports are the typical sales reporting fare: Gather the sales of a
time period based some criteria, aggregate them by product, and then join with
a bunch of other tables to display different kinds of product information.
I'll spare you all the pain of looking at the entire queries: The ultimate
issue appears to be the same: The innermost table of the queries is an inline
view, which aggregates the data by product. It runs rather quickly, but
postgres underestimates the number of rows that come out of it, making the rest
of the query plan rather suboptimal. The inline view look like this
select sku_id, sum(rs.price) as dollarsSold, sum(rs.quantity) as units
from reporting.sales rs
where rs.sale_date between ? AND ? group by sku_id
In some cases, we see extra conditions aside of the dates, but they have the
same shape. Barring a massive date range, the rest of the filters are less
selective than the date, so postgres uses an index on sale_date,sku_id. I have
increased the statistics calculations on sale_date quite a bit to make sure
Postgres makes decent row estimates.The problem is in the aggregation:
"HashAggregate (cost=54545.20..54554.83 rows=642 width=24) (actual
time=87.945..98.219 rows=11462 loops=1)"
" -> Index Scan using reporting_sales_sale_date_idx on sales rs
(cost=0.00..54288.63 rows=34209 width=24) (actual time=0.042..34.194 rows=23744
loops=1)"
" Index Cond: ((sale_date >= '2009-07-01 00:00:00'::timestamp without
time zone) AND (sale_date <= '2009-07-06 00:00:00'::timestamp without time
zone))"
"Total runtime: 10.110 ms"
As you an seem the Index scan's estimate is pretty close when I use a single
condition, but the aggregate estimate is off by a factor of 20. When I add
further conditions, the estimate just gets worse and worse.
"HashAggregate (cost=8894.83..8894.85 rows=1 width=24) (actual
time=6.444..6.501 rows=92 loops=1)"
" -> Index Scan using reporting_sales_sale_date_sku_id_idx on sales rs
(cost=0.00..8894.76 rows=9 width=24) (actual time=0.103..6.278 rows=94 loops=1)"
" Index Cond: ((sale_date >= '2009-07-01 00:00:00'::timestamp without
time zone) AND (sale_date <= '2009-07-06 00:00:00'::timestamp without time
zone) AND ((sale_channel)::text = 'RETAIL'::text))"
" Filter: ((activity_type)::text = 'RETURN'::text)"
"Total runtime: 6.583 ms"
I think I've done what I could when it comes to altering statistics: For
example, activity_type and sale_channel have full statistics, and they are
rather independent as filtering mechanisms: If all Postgres did when trying to
estimate their total filtering capacity was just multiply the frequency of each
value, the estimates would not be far off.
The killer seems to be the row aggregation. There are about 95K different
values of sku_id in the sales table, and even the best seller items are a very
small percentage of all rows, so expecting the aggregation to consolidate the
rows 50:1 like it does in one of the explains above is a pipe dream. I've
increased statistics in sku_id into the three digits, but results are not any
better
schemaname;tablename;attname;null_frac;avg_width;n_distinct;most_common_freqs
"reporting";"sales";"sku_id";0;11;58337;"{0.00364167,0.0027125,0.00230417,0.00217083,0.00178333,0.001675,0.00136667,0.00135,0.0012875,0.0011875,...."
Is there any way I can coax Postgres into making a more realistic aggregation
estimate? I could just delay aggregation until the rest of the data is joined,
making the estimate's failure moot, but the price would be quite hefty in some
of the reports, which could return 20K products and widths of over 150, so it's
not optimal, especially when right now the same query that can request 100 rows
could end up requesting 80K.