Thanks for all the suggestions. It seems that creating indices, or even import data using a copy is easy to implement. I also have some jobs that create reports and want to try if I gain anything if i work reports in parallel.

will give it a try in the next week and let you know the resuls.

Alex

John A Meinel wrote:

Alex wrote:

Thanks John.

Well as I mentioned. I have a Dual AMD Opteron 64 2.4ghz, 15k rpm SCSI Disks, 4GB of memory.
Disks are pretty fast and memory should be more than enough. Currently we dont have many concurrent connections.


Well, you didn't mention Opteron before (it makes a difference against Xeons).
How many disks and in what configuration?
Do you have pg_xlog on a separate set of disks?
Are your drives in RAID 10 (0+1) or RAID 5?


If you have enough disks the recommended configuration is at least a RAID1 for the OS, RAID 10 for pg_xlog (4drives), and RAID 10 (the rest of the drives) for the actual data.

If your dataset is read heavy, or you have more than 6 disks, you can get away with RAID 5 for the actual data. But since you are talking about loading 8million rows at once, it certainly sounds like you are write heavy.

If you only have a few disks, it's still probably better to put pg_xlog on it's own RAID1 (2-drive) mirror. pg_xlog is pretty much append only, so if you dedicate a disk set to it, you eliminate a lot of seek times.

I run PG 8.0.1 on Fedora Core 3

When I now run the batch job, one CPU runs in the 80-90% the other in 5-10% max.



Anyway, it doesn't completely sound like you are CPU limited, but you might be able to get a little bit more if you spawn another process. Have you tried dropping the index, doing the copy, and then recreating the 4-indexes in separate processes?


The simple test for this is to open 3-4 psql connections, have one of them drop the indexes and do the copy, in the other connections you can already have typed "CREATE INDEX ..." so when the copy is done and committed to the database, you just go to the other terminals and hit enter.

Unfortunately you'll have to use wall clock time to see if this is faster.

Though I think you could do the same thing with a bash script. The authentication should be in "trust" mode so that you don't take the time to type your password.

#!/bin/bash
psql -h <host> -c "DROP INDEX ...; COPY FROM ..."

psql -h <host> -c "CREATE INDEX ..." &
psql -h <host> -c "CREATE INDEX ..." &
psql -h <host> -c "CREATE INDEX ..." &
psql -h <host> -c "CREATE INDEX ..."


Now, I don't really know how to wait for all child processes in a bash script (I could give you the python for it, but you're a perl guy). But by not spawning the last INDEX, I'm hoping it takes longer than the rest. Try to put the most difficult index there.


Then you could just run

time loadscript.sh

I'm sure you could do the equivalent in perl. Just open multiple connections to the DB, and have them ready.

I'm guessing since you are on a dual processor machine, you won't get much better performance above 2 connections.

You can also try doing 2 COPYs at the same time, but it seems like you would have issues. Do you have any serial columns that you expect to be in a certain order, or is all the information in the copy?

If the latter, try it, let us know what you get. I can't tell you the perl for this, since I'm not a perl guy.

John
=:->




---------------------------(end of broadcast)--------------------------- TIP 5: Have you checked our extensive FAQ?

http://www.postgresql.org/docs/faq

Reply via email to