Agree with Tom on his point about avoidance of cost param adjustments to fit
specific test cases.
A few suggestions...as I assume you own this database...
- check out pg_statio_user_tables - optimize your cache hit ratio on blocks
read...different time durations... pg_stat_bgwriter (read from a script or
something and snapshot)
- pg_buffercache in contrib/
- /proc/meminfo on linux
- find out exactly what is going on with your kernel buffer cache (size, how
it is buffering) and if your controller or drive is using a read ahead
cache.
- might want to play around with partial indexes vs. and/or range
partitioning with exclusion constraints, etc.
- define I/O characteristics of the dataset - taking into account index
clustering and index order on in-memory pages (i.e. re-cluster?), why need
for multiple index if clustering indexes on heap?
- solidify the referential integrity constraints between those tables, on
paper....define the use cases before modifying the database tables...i
assume this is a dev database
- linux fs mount options to explore - i.e. noatime, writeback, etc.
-maybe look at prepared statements if you are running alot of similar
queries from a single session? assuming web front end for your db - with say
frequently queried region/category/dates for large read-only dataset with
multiple join conditions?

There are some good presentations on pgcon.org from PGCon 2010 that was held
last week...
 http://www.pgcon.org/2010/schedule/events/218.en.html

If you take everything into account and model it correctly (not too loose,
not too tight), your solution will be reusable and will save time and
hardware expenses.

Regards -

Bryan



On Thu, May 27, 2010 at 2:43 AM, David Jarvis <thanga...@gmail.com> wrote:

> Hi, Bryan.
>
> I was just about to reply to the thread, thanks for asking. Clustering was
> key. After rebooting the machine (just to make sure absolutely nothing was
> cached), I immediately ran a report on Toronto: 5.25 seconds!
>
> Here's what I did:
>
>    1. Created a new set of tables that matched the old set, with
>    statistics of 1000 on the station and taken (date) columns.
>    2. Inserted the data from the old hierarchy into the new set, ordered
>    by station id then by date (same seven child tables as before: one per
>    category).
>       - I wanted to ensure a strong correlation between primary key and
>       station id.
>       3. Added three indexes per table: (a) station id; (b) date taken;
>    and (c) station-taken-category.
>    4. Set the station-taken-category index as CLUSTER.
>    5. Vacuumed the new tables.
>    6. Dropped the old tables.
>    7. Set the following configuration values:
>       - shared_buffers = 1GB
>       - temp_buffers = 32MB
>       - work_mem = 32MB
>       - maintenance_work_mem = 64MB
>       - seq_page_cost = 1.0
>       - random_page_cost = 2.0
>       - cpu_index_tuple_cost = 0.001
>       - effective_cache_size = 512MB
>
> I ran a few more reports (no reboots, but reading vastly different data
> sets):
>
>    - Vancouver: 4.2s
>    - Yellowknife: 1.7s
>    - Montreal: 6.5s
>    - Trois-Riviers: 2.8s
>
> No full table scans. I imagine some indexes are not strictly necessary and
> will test to see which can be removed (my guess: the station and taken
> indexes). The problem was that the station ids were scattered and so
> PostgreSQL presumed a full table scan would be faster.
>
> Physically ordering the data by station ids triggers index use every time.
>
> Next week's hardware upgrade should halve those times -- unless anyone has
> further suggestions to squeeze more performance out of PG. ;-)
>
> Dave
>
>

Reply via email to