Re: [PERFORM] Reading data in bulk - help?

Chris Huston Wed, 10 Sep 2003 00:39:47 -0700

Thanks Josh that helped. I had gone looking for some kind of cluster option but was looking under create database, create index and initlocation - didn't see the CLUSTER index ON table.

I ran the CLUSTER which took about 2 1/2 hours to complete. That improved the query performance about 6x - which is great - but is still taking 26 minutes to do what a serial read does in about 2 1/2 minutes.

At this point I'm ok because each fetch is taking around 200 milliseconds from call to the time the data is ready. The processing takes 300-600ms per batch. I've got the fetch and the processing running in separate threads so even if postgres was running faster it wouldn't help this implementation.

However, "iostat" is still reporting average size per transfer of about 10kB and total thru-put of about 1MB/s. The transfers per second went from >200/s to about 80/s. It still seams like it ought to be a faster.

The system is currently running on a single processor 500Mhz G4. We're likely to move to a two processor 2Ghz G5 in the next few months. Then each block may take only a 30-60 milliseconds to complete and their can be two concurrent blocks processing at once.

Sometime before then I need to figure out how to cut the fetch times from the now 200ms to something like 10ms. There are currently 1,628,800 records in the single data table representing 6817 groups. Each group has 2 to 284 records - with 79% having the max 284 (max grows by 1 every day - although the value may change throughout the day). Each record is maybe 1 or 2k so ideally each batch/group should require 284-568k - at 10MB/s - that'd be

RELATED QUESTION: How now do I speed up the following query: "select distinct group_id from datatable"? Which results in a sequential scan of the db. Why doesn't it use the group_id index? I only do this once per run so it's not as critical as the fetch speed which is done 6817 times.

Thanks for the help!
- Chris

On Tuesday, Sep 9, 2003, at 18:11 America/Denver, Josh Berkus wrote:

Chris,

I've got an application that needs to chunk through ~2GB of data. The data is ~7000 different sets of 300 records each. I put all of the data into a postgres database but that doesn't look like its going to work because of how the data lives on the disk.
Your problem is curable through 4 steps:

1) adjust your postgresql.conf to appropriate levels for memory usage.

2) if those sets of 300 are blocks in some contiguous order, then cluster them to force their physical ordering on disk to be the same order you want to read them in. This will require you to re-cluster whenever you change a significant number of records, but from the sound of it that happens in batches.

3) Get better disks, preferrably a RAID array, or just very fast scsi if the database is small. If you're budget-constrained, Linux software raid (or BSD raid) on IDE disks is cheap. What kind of RAID depends on what else you'll be doing with the app; RAID 5 is better for read-only access, RAID 1+0 is better for read-write.

4) Make sure that you aren't dumping the data to the same disk postgreSQL lives on! Preferably, make sure that your swap partition is on a different disk/array from postgresql. If the computing app is complex and requires disk reads aside from postgres data, you should make sure that it lives on yet another disk. Or you can simplify this with a good, really large multi-channel RAID array.
--
-Josh Berkus
 Aglio Database Solutions
 San Francisco
---------------------------(end of broadcast)--------------------------- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly

---------------------------(end of broadcast)---------------------------
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]

Re: [PERFORM] Reading data in bulk - help?

Reply via email to