Jan Wieck <[EMAIL PROTECTED]> writes:Or maybe fdatasync() would be slightly more efficient - do we care about flushing metadata that much?
What still needs to be addressed is the IO storm cause by checkpoints. I see it much relaxed when stretching out the BufferSync() over most of the time until the next one should occur. But the kernel sync at it's end still pushes the system hard against the wall.
I have never been happy with the fact that we use sync(2) at all. Quite aside from the "I/O storm" issue, sync() is really an unsafe way to do a checkpoint, because there is no way to be certain when it is done. And on top of that, it does too much, because it forces syncing of files unrelated to Postgres.
I would like to see us go over to fsync, or some other technique that gives more certainty about when the write has occurred. There might be some scope that way to allow stretching out the I/O, too.
The main problem with this is knowing which files need to be fsync'd. The only idea I have come up with is to move all buffer write operations into a background writer process, which could easily keep track of every file it's written into since the last checkpoint. This could cause problems though if a backend wants to acquire a free buffer and there's none to be had --- do we want it to wait for the background process to do something? We could possibly say that backends may write dirty buffers for themselves, but only if they fsync them immediately. As long as this path is seldom taken, the extra fsyncs shouldn't be a big performance problem.
Actually, once you build it this way, you could make all writes synchronous (open the files O_SYNC) so that there is never any need for explicit fsync at checkpoint time. The background writer process would be the one incurring the wait in most cases, and that's just fine. In this way you could directly control the rate at which writes are issued, and there's no I/O storm at all. (fsync could still cause an I/O storm if there's lots of pending writes in a single file.)
cheers
andrew
---------------------------(end of broadcast)--------------------------- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly