It is. Maybe we are anomalous here.
> I don't think I buy that claim. We don't normally fsync the log file > except at transaction commit (and read-only transactions > don't generate > any commit record, so they don't cause an fsync). If a single > transaction is generating lots of log data, it doesn't have > to wait for > that data to hit disk before it can do more stuff.
I have glanced at the code, and I agree that reads do not generate fsync() calls. Since I watched my mirrored RAID 5 arrays hit 2,000 iops with an average request of 4 KB on a recent batch update with Pg, I still think that Pg may be fsync()-ing a bit too often.
Since I haven't digested all of the code, I am speaking a bit out of turn.
> But having said that --- on some platforms our default WAL sync method > is open_datasync, which could result in the sort of behavior you are > talking about. Try experimenting with the other possible values of > wal_sync_method to see if you like them better.
I will have to check that. I am running Sparc Solaris 8.
> That probably gets you into a situation where no I/O > is really happening > at all, it's just being absorbed by kernel disk > buffers.
Few things would please me more.
> Unfortunately > that doesn't have a lot to do with the performance you can get if you > want to be sure you don't lose data ...
I am not sure these are as mutually exclusive as it looks here.
> > BTW, one thing you can do to reduce the WAL I/O volume in Postgres is > to increase the inter-checkpoint interval (there are two settings to > increase, one time-based and one volume-based). The first write of a > given data page after a checkpoint dumps the whole page into WAL, as a > safety measure to deal with partial page writes during power failures. > So right after a checkpoint the WAL volume goes way up. With a longer > interval between checkpoints you don't pay that price as often.
I did that and it helped tremendously. Without proper tuning, I just made the numbers pretty large:
shared_buffers = 100000 sort_mem = 131072 vacuum_mem = 65536 wal_buffers = 8192 checkpoint_segments = 32
Thanks for your feedback.
Sincerely, Marty
Tom Lane wrote:
Marty Scholes <[EMAIL PROTECTED]> writes:
My experience with Oracle (and now limited experience with Pg) is that the major choke point in performance is not the CPU or read I/O, it is the log performance of big update and select statements.
If your load is primarily big update statements, maybe so...
Essentially, the data is written twice: first to the log and then the data files. This would be ok except the transaction is regularly frozen while the log files sync to disk with a bunch of tiny (8KB for Oracle and Pg) write requests.
I don't think I buy that claim. We don't normally fsync the log file except at transaction commit (and read-only transactions don't generate any commit record, so they don't cause an fsync). If a single transaction is generating lots of log data, it doesn't have to wait for that data to hit disk before it can do more stuff.
But having said that --- on some platforms our default WAL sync method is open_datasync, which could result in the sort of behavior you are talking about. Try experimenting with the other possible values of wal_sync_method to see if you like them better.
If a transaction will do large updates or inserts, why don't we just log the parsed statements in the WAL instead of the individual data blocks that have changed?
As already pointed out, this would not give enough information to reproduce the database state.
Some informal testing suggests that we get a factor of 8 improvement in speed here if we completely disable fsync() in large updates under Pg.
That probably gets you into a situation where no I/O is really happening at all, it's just being absorbed by kernel disk buffers. Unfortunately that doesn't have a lot to do with the performance you can get if you want to be sure you don't lose data ...
BTW, one thing you can do to reduce the WAL I/O volume in Postgres is to increase the inter-checkpoint interval (there are two settings to increase, one time-based and one volume-based). The first write of a given data page after a checkpoint dumps the whole page into WAL, as a safety measure to deal with partial page writes during power failures. So right after a checkpoint the WAL volume goes way up. With a longer interval between checkpoints you don't pay that price as often.
regards, tom lane
---------------------------(end of broadcast)--------------------------- TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]
---------------------------(end of broadcast)--------------------------- TIP 7: don't forget to increase your free space map settings