I ran the CLUSTER which took about 2 1/2 hours to complete. That improved the query performance about 6x - which is great - but is still taking 26 minutes to do what a serial read does in about 2 1/2 minutes.
At this point I'm ok because each fetch is taking around 200 milliseconds from call to the time the data is ready. The processing takes 300-600ms per batch. I've got the fetch and the processing running in separate threads so even if postgres was running faster it wouldn't help this implementation.
However, "iostat" is still reporting average size per transfer of about 10kB and total thru-put of about 1MB/s. The transfers per second went from >200/s to about 80/s. It still seams like it ought to be a faster.
The system is currently running on a single processor 500Mhz G4. We're likely to move to a two processor 2Ghz G5 in the next few months. Then each block may take only a 30-60 milliseconds to complete and their can be two concurrent blocks processing at once.
Sometime before then I need to figure out how to cut the fetch times from the now 200ms to something like 10ms. There are currently 1,628,800 records in the single data table representing 6817 groups. Each group has 2 to 284 records - with 79% having the max 284 (max grows by 1 every day - although the value may change throughout the day). Each record is maybe 1 or 2k so ideally each batch/group should require 284-568k - at 10MB/s - that'd be
RELATED QUESTION: How now do I speed up the following query: "select distinct group_id from datatable"? Which results in a sequential scan of the db. Why doesn't it use the group_id index? I only do this once per run so it's not as critical as the fetch speed which is done 6817 times.
Thanks for the help! - Chris
On Tuesday, Sep 9, 2003, at 18:11 America/Denver, Josh Berkus wrote:
I've got an application that needs to chunk through ~2GB of data. The
data is ~7000 different sets of 300 records each. I put all of the data
into a postgres database but that doesn't look like its going to work
because of how the data lives on the disk.
Your problem is curable through 4 steps:
1) adjust your postgresql.conf to appropriate levels for memory usage.
2) if those sets of 300 are blocks in some contiguous order, then cluster them
to force their physical ordering on disk to be the same order you want to
read them in. This will require you to re-cluster whenever you change a
significant number of records, but from the sound of it that happens in
3) Get better disks, preferrably a RAID array, or just very fast scsi if the
database is small. If you're budget-constrained, Linux software raid (or
BSD raid) on IDE disks is cheap. What kind of RAID depends on what else
you'll be doing with the app; RAID 5 is better for read-only access, RAID 1+0
is better for read-write.
4) Make sure that you aren't dumping the data to the same disk postgreSQL
lives on! Preferably, make sure that your swap partition is on a different
disk/array from postgresql. If the computing app is complex and requires
disk reads aside from postgres data, you should make sure that it lives on
yet another disk. Or you can simplify this with a good, really large
multi-channel RAID array.
-- -Josh Berkus Aglio Database Solutions San Francisco
---------------------------(end of broadcast)---------------------------
TIP 3: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to [EMAIL PROTECTED] so that your
message can get through to the mailing list cleanly
---------------------------(end of broadcast)--------------------------- TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]