Re: [SQL] PG7.4.5: query not using index on date column

Dave Steinberg Mon, 29 Nov 2004 18:45:11 -0800

Sorry for the delay in getting back on this thread, painting my apartment meant no internet over this thanksgiving break... :)

How many rows in the table altogether? A rough guess is a few million based on the estimated seqscan cost. That would mean that this query is retrieving about 10% of the table, which is a large enough fraction that the planner will probably think a seqscan is best. It may be right. If you do "set enable_seqscan = off", how does the EXPLAIN ANALYZE output change?

Right now it's a hair under 800k rows. With enable_seqscan = off, I get this as my analyze results:

QUERY PLAN ------------------------------------------------------------------------ ------------------------------------------------------------------------ ------------- Sort (cost=361676.23..361676.33 rows=38 width=226) (actual time=46076.756..46076.924 rows=69 loops=1) Sort Key: count -> Subquery Scan aggs (cost=361674.10..361675.24 rows=38 width=226) (actual time=46068.621..46076.159 rows=69 loops=1) -> HashAggregate (cost=361674.10..361674.86 rows=38 width=54) (actual time=46068.596..46075.170 rows=69 loops=1) -> Index Scan using received_date_idx on messages (cost=0.00..349968.44 rows=585283 width=54) (actual time=20.988..15020.821 rows=589543 loops=1) Index Cond: ((received_date >= '2004-11-01'::date) AND (received_date <= '2004-11-30'::date)) Total runtime: 46091.315 ms

Quite a bit higher with these estimates. Using your 10% of the table guess above, I did a breakdown by date and got these results:

[EMAIL PROTECTED]:geekisp=>select received_date, count(received_date) from spamreport.messages group by received_date order by received_date asc; received_date | count ---------------+------- 2004-10-20 | 7592 2004-10-21 | 19699 2004-10-22 | 17311 2004-10-23 | 16730 2004-10-24 | 18249 2004-10-25 | 16718 2004-10-26 | 16951 2004-10-27 | 19818 2004-10-28 | 19580 2004-10-29 | 17610 2004-10-30 | 16210 2004-10-31 | 20468 2004-11-01 | 12337 2004-11-02 | 9012 2004-11-03 | 20871 2004-11-04 | 20103 2004-11-05 | 18807 2004-11-06 | 20131 2004-11-07 | 22291 2004-11-08 | 23041 2004-11-09 | 20254 2004-11-10 | 17810 2004-11-11 | 21091 2004-11-12 | 21976 2004-11-13 | 18824 2004-11-14 | 20543 2004-11-15 | 18829 2004-11-16 | 24248 2004-11-17 | 18093 2004-11-18 | 25675 2004-11-19 | 27084 2004-11-20 | 22362 2004-11-21 | 25187 2004-11-22 | 26451 2004-11-23 | 26016 2004-11-24 | 23147 2004-11-25 | 25785 2004-11-26 | 20584 2004-11-27 | 25615 2004-11-28 | 6931 2004-11-29 | 6549 (41 rows)

So it looks like an aggregation of 2 weeks worth of data is more than 10%, so its aligned with what you were saying. That also jives, since when I re-enable seq_scan and do the report for one or two day's of data - it uses the index (previously it did not - perhaps there was not enough data).

If it's not right, you may want to try to adjust random_page_cost and/or effective_cache_size so that the planner's estimated costs are more in line with reality. Beware of making such adjustments on the basis of only one test case, though.

I'll look in the manual and try playing with these options, thanks for your suggestions! In the end, I'm guessing that if I need to retrieve a few hundred thousand rows from disk, faster disks might make the biggest impact on performance.

Thanks for your help!
--
Dave Steinberg
http://www.geekisp.com/

---------------------------(end of broadcast)---------------------------
TIP 7: don't forget to increase your free space map settings

Re: [SQL] PG7.4.5: query not using index on date column

Reply via email to