Agree with Tom on his point about avoidance of cost param adjustments to fit specific test cases. A few suggestions...as I assume you own this database... - check out pg_statio_user_tables - optimize your cache hit ratio on blocks read...different time durations... pg_stat_bgwriter (read from a script or something and snapshot) - pg_buffercache in contrib/ - /proc/meminfo on linux - find out exactly what is going on with your kernel buffer cache (size, how it is buffering) and if your controller or drive is using a read ahead cache. - might want to play around with partial indexes vs. and/or range partitioning with exclusion constraints, etc. - define I/O characteristics of the dataset - taking into account index clustering and index order on in-memory pages (i.e. re-cluster?), why need for multiple index if clustering indexes on heap? - solidify the referential integrity constraints between those tables, on paper....define the use cases before modifying the database tables...i assume this is a dev database - linux fs mount options to explore - i.e. noatime, writeback, etc. -maybe look at prepared statements if you are running alot of similar queries from a single session? assuming web front end for your db - with say frequently queried region/category/dates for large read-only dataset with multiple join conditions?
There are some good presentations on pgcon.org from PGCon 2010 that was held last week... http://www.pgcon.org/2010/schedule/events/218.en.html If you take everything into account and model it correctly (not too loose, not too tight), your solution will be reusable and will save time and hardware expenses. Regards - Bryan On Thu, May 27, 2010 at 2:43 AM, David Jarvis <thanga...@gmail.com> wrote: > Hi, Bryan. > > I was just about to reply to the thread, thanks for asking. Clustering was > key. After rebooting the machine (just to make sure absolutely nothing was > cached), I immediately ran a report on Toronto: 5.25 seconds! > > Here's what I did: > > 1. Created a new set of tables that matched the old set, with > statistics of 1000 on the station and taken (date) columns. > 2. Inserted the data from the old hierarchy into the new set, ordered > by station id then by date (same seven child tables as before: one per > category). > - I wanted to ensure a strong correlation between primary key and > station id. > 3. Added three indexes per table: (a) station id; (b) date taken; > and (c) station-taken-category. > 4. Set the station-taken-category index as CLUSTER. > 5. Vacuumed the new tables. > 6. Dropped the old tables. > 7. Set the following configuration values: > - shared_buffers = 1GB > - temp_buffers = 32MB > - work_mem = 32MB > - maintenance_work_mem = 64MB > - seq_page_cost = 1.0 > - random_page_cost = 2.0 > - cpu_index_tuple_cost = 0.001 > - effective_cache_size = 512MB > > I ran a few more reports (no reboots, but reading vastly different data > sets): > > - Vancouver: 4.2s > - Yellowknife: 1.7s > - Montreal: 6.5s > - Trois-Riviers: 2.8s > > No full table scans. I imagine some indexes are not strictly necessary and > will test to see which can be removed (my guess: the station and taken > indexes). The problem was that the station ids were scattered and so > PostgreSQL presumed a full table scan would be faster. > > Physically ordering the data by station ids triggers index use every time. > > Next week's hardware upgrade should halve those times -- unless anyone has > further suggestions to squeeze more performance out of PG. ;-) > > Dave > >