My experience with Oracle (and now limited experience with Pg) is that the major choke point in performance is not the CPU or read I/O, it is the log performance of big update and select statements.
Essentially, the data is written twice: first to the log and then the data files. This would be ok except the transaction is regularly frozen while the log files sync to disk with a bunch of tiny (8KB for Oracle and Pg) write requests.
I realize that the logs must be there to ensure crash recovery and that PITR is on the way to supplement this.
If a transaction will do large updates or inserts, why don't we just log the parsed statements in the WAL instead of the individual data blocks that have changed? Then, the data files could be fsync()ed every checkpoint, but essentially no write I/O takes places in the interim.
Outside of checkpoints, large updates would only need to fsync() a very small addition to the log files.
Recovery could be similar to how I understand it currently is:
1. Roll back in-flight changes
2. Roll forward log entries in order, either direct changes to the data or re-execute the parsed command in the log.
Some informal testing suggests that we get a factor of 8 improvement in speed here if we completely disable fsync() in large updates under Pg.
I would suspect that a big portion of those gains would be preserved if fsync() calls were limited to checkpoints and saving the parsed SQL command in the log.
Why have I not seen this in any database?
There must be a reason.
Thanks in advance.
Sincerely, Marty
---------------------------(end of broadcast)--------------------------- TIP 6: Have you searched our list archives?
http://archives.postgresql.org