[PERFORM] Hashaggregate estimates

Jorge Montero Wed, 13 Jan 2010 14:00:50 -0800
I'm having some performance problems in a few sales reports running on postgres 
8.3, running on Redhat 4.1.2. The hardware is a bit old, but it performs well 
enough. The reports are the typical sales reporting fare: Gather the sales of a 
time period based some criteria, aggregate them by product, and then join with 
a bunch of other tables to display different kinds of product information.
 
I'll spare you all the pain of looking at the entire queries: The ultimate 
issue appears to be the same: The innermost table of the queries is an inline 
view, which aggregates the data by product. It runs rather quickly, but 
postgres underestimates the number of rows that come out of it, making the rest 
of the query plan rather suboptimal. The inline view look like this
 
select sku_id, sum(rs.price) as dollarsSold, sum(rs.quantity) as units  
  from reporting.sales rs 
  where rs.sale_date between ? AND ?  group by sku_id
 
In some cases, we see extra conditions aside of the dates, but they have the 
same shape. Barring a massive date range, the rest of the filters are less 
selective than the date, so postgres uses an index on sale_date,sku_id. I have 
increased the statistics calculations on sale_date quite a bit to make sure 
Postgres makes decent row estimates.The problem is in the aggregation:
 
"HashAggregate  (cost=54545.20..54554.83 rows=642 width=24) (actual 
time=87.945..98.219 rows=11462 loops=1)"
"  ->  Index Scan using reporting_sales_sale_date_idx on sales rs  
(cost=0.00..54288.63 rows=34209 width=24) (actual time=0.042..34.194 rows=23744 
loops=1)"
"        Index Cond: ((sale_date >= '2009-07-01 00:00:00'::timestamp without 
time zone) AND (sale_date <= '2009-07-06 00:00:00'::timestamp without time 
zone))"
"Total runtime: 10.110 ms"
 
As you an seem the Index scan's estimate is pretty close when I use a single 
condition, but the aggregate estimate is off by a factor of 20. When I add 
further conditions, the estimate just gets worse and worse.
 
"HashAggregate  (cost=8894.83..8894.85 rows=1 width=24) (actual 
time=6.444..6.501 rows=92 loops=1)"
"  ->  Index Scan using reporting_sales_sale_date_sku_id_idx on sales rs  
(cost=0.00..8894.76 rows=9 width=24) (actual time=0.103..6.278 rows=94 loops=1)"
"        Index Cond: ((sale_date >= '2009-07-01 00:00:00'::timestamp without 
time zone) AND (sale_date <= '2009-07-06 00:00:00'::timestamp without time 
zone) AND ((sale_channel)::text = 'RETAIL'::text))"
"        Filter: ((activity_type)::text = 'RETURN'::text)"
"Total runtime: 6.583 ms"
I think I've done what I could when it comes to altering statistics: For 
example, activity_type and sale_channel have full statistics, and they are 
rather independent as filtering mechanisms: If all Postgres did when trying to 
estimate their total filtering capacity was just multiply the frequency of each 
value, the estimates would not be far off.
 
The killer seems to be the row aggregation. There are about  95K different 
values of sku_id in the sales table, and even the best seller items are a very 
small percentage of all rows, so expecting the aggregation to consolidate the 
rows 50:1 like it does in one of the explains above is a pipe dream. I've 
increased statistics in sku_id into the three digits, but results are not any 
better
 
schemaname;tablename;attname;null_frac;avg_width;n_distinct;most_common_freqs
"reporting";"sales";"sku_id";0;11;58337;"{0.00364167,0.0027125,0.00230417,0.00217083,0.00178333,0.001675,0.00136667,0.00135,0.0012875,0.0011875,...."
 
Is there any way I can coax Postgres into making a more realistic aggregation 
estimate? I could just delay aggregation until the rest of the data is joined, 
making the estimate's failure moot, but the price would be quite hefty in some 
of the reports, which could return 20K products and widths of over 150, so it's 
not optimal, especially when right now the same query that can request 100 rows 
could end up requesting 80K.
[PERFORM] Hashaggregate estimates

Reply via email to