How many rows in the table altogether? A rough guess is a few million
based on the estimated seqscan cost. That would mean that this query
is retrieving about 10% of the table, which is a large enough fraction
that the planner will probably think a seqscan is best. It may be right.
If you do "set enable_seqscan = off", how does the EXPLAIN ANALYZE
output change?
Right now it's a hair under 800k rows. With enable_seqscan = off, I get this as my analyze results:
QUERY PLAN
------------------------------------------------------------------------ ------------------------------------------------------------------------ -------------
Sort (cost=361676.23..361676.33 rows=38 width=226) (actual time=46076.756..46076.924 rows=69 loops=1)
Sort Key: count
-> Subquery Scan aggs (cost=361674.10..361675.24 rows=38 width=226) (actual time=46068.621..46076.159 rows=69 loops=1)
-> HashAggregate (cost=361674.10..361674.86 rows=38 width=54) (actual time=46068.596..46075.170 rows=69 loops=1)
-> Index Scan using received_date_idx on messages (cost=0.00..349968.44 rows=585283 width=54) (actual time=20.988..15020.821 rows=589543 loops=1)
Index Cond: ((received_date >= '2004-11-01'::date) AND (received_date <= '2004-11-30'::date))
Total runtime: 46091.315 ms
Quite a bit higher with these estimates. Using your 10% of the table guess above, I did a breakdown by date and got these results:
[EMAIL PROTECTED]:geekisp=>select received_date, count(received_date) from spamreport.messages group by received_date order by received_date asc;
received_date | count
---------------+-------
2004-10-20 | 7592
2004-10-21 | 19699
2004-10-22 | 17311
2004-10-23 | 16730
2004-10-24 | 18249
2004-10-25 | 16718
2004-10-26 | 16951
2004-10-27 | 19818
2004-10-28 | 19580
2004-10-29 | 17610
2004-10-30 | 16210
2004-10-31 | 20468
2004-11-01 | 12337
2004-11-02 | 9012
2004-11-03 | 20871
2004-11-04 | 20103
2004-11-05 | 18807
2004-11-06 | 20131
2004-11-07 | 22291
2004-11-08 | 23041
2004-11-09 | 20254
2004-11-10 | 17810
2004-11-11 | 21091
2004-11-12 | 21976
2004-11-13 | 18824
2004-11-14 | 20543
2004-11-15 | 18829
2004-11-16 | 24248
2004-11-17 | 18093
2004-11-18 | 25675
2004-11-19 | 27084
2004-11-20 | 22362
2004-11-21 | 25187
2004-11-22 | 26451
2004-11-23 | 26016
2004-11-24 | 23147
2004-11-25 | 25785
2004-11-26 | 20584
2004-11-27 | 25615
2004-11-28 | 6931
2004-11-29 | 6549
(41 rows)
So it looks like an aggregation of 2 weeks worth of data is more than 10%, so its aligned with what you were saying. That also jives, since when I re-enable seq_scan and do the report for one or two day's of data - it uses the index (previously it did not - perhaps there was not enough data).
If it's not right, you may want to try to adjust random_page_cost and/or
effective_cache_size so that the planner's estimated costs are more in
line with reality. Beware of making such adjustments on the basis of
only one test case, though.
I'll look in the manual and try playing with these options, thanks for your suggestions! In the end, I'm guessing that if I need to retrieve a few hundred thousand rows from disk, faster disks might make the biggest impact on performance.
Thanks for your help! -- Dave Steinberg http://www.geekisp.com/
---------------------------(end of broadcast)--------------------------- TIP 7: don't forget to increase your free space map settings