Tom Lane wrote: > Kevin Brown <[EMAIL PROTECTED]> writes: > > So the backends have to keep a common list of all the files they > > touch. Admittedly, that could be a problem if it means using a bunch > > of shared memory, and it may have additional performance implications > > depending on the implementation ... > > It would have to be a list of all files that have been touched since the > last checkpoint. That's a serious problem for storage in shared memory, > which is by definition fixed-size.
Of course, the file list needn't be stored in SysV shared memory. It could be stored in a file that's later read by the checkpointing process. The backends could serialize their writes via fcntl() or ioctl() style locks, whichever is appropriate. Locking might even be avoided entirely if the individual writes are small enough. > Right. "Portably" was the key word in my comment (sorry for not > emphasizing this more clearly). The real problem here is how to know > what is the actual behavior of each platform? I'm certainly not > prepared to trust reading-between-the-lines-of-some-man-pages. Reading between the lines isn't necessarily required, just literal interpretation. :-) > And I can't think of a simple yet reliable direct test. You'd > really have to invest detailed study of the kernel source code to > know for sure ... and many of our platforms don't have open-source > kernels. Linux appears to do the right thing with the file data itself, even if it doesn't handle the directory entry simultaneously. Others claim, in messages written to pgsql-general and elsewhere (via Google search), that FreeBSD does the right thing for sure. I certainly agree that non-open-source kernels are uncertain. That's why it wouldn't be a bad idea to control this via a GUC variable. > > Under Linux (and perhaps HP-UX), it may be necessary to fsync() the > > directories leading to the file as well, so that the state of the > > filesystem on disk is consistent and safe in the event that the files > > in question are newly-created. > > AFAIK, all Unix implementations are paranoid about consistency of > filesystem metadata, including directory contents. Not ext2 under Linux! By default, it writes everything asynchronously. I don't know how many people use ext2 to do serious tasks under Linux, so this may not be that much of an issue. > So fsync'ing directories from a user process strikes me as a waste > of time, even assuming that it were portable, which I doubt. What > we need to worry about is whether fsync'ing a bunch of our own data > files is a practical substitute for a global sync() call. I'm positive that under certain operating systems, fsyncing the data is a better option than a global sync(), especially since sync() isn't guaranteed to wait until the buffers are flushed. Right now the state of the data on disk immediately after a checkpoint is just a guess because of that. I don't see that using fsync() would introduce significantly more uncertainty on systems where the manpage explicitly says that the buffers associated with the file referenced by the file descriptor are the ones written to disk. For instance, the FreeBSD manpage says: Fsync() causes all modified data and attributes of fd to be moved to a permanent storage device. This normally results in all in-core modified copies of buffers for the associated file to be written to a disk. Fsync() should be used by programs that require a file to be in a known state, for example, in building a simple transaction facility. and the Linux manpage says: fsync copies all in-core parts of a file to disk, and waits until the device reports that all parts are on stable storage. It also updates metadata stat information. It does not necessarily ensure that the entry in the directory containing the file has also reached disk. For that an explicit fsync on the file descriptor of the directory is also needed. Both are rather unambiguous, and a cursory review of the Linux source confirms what its manpage says, at least. The FreeBSD manpage might be ambiguous, but the fact that they also have an fsync command line utility essentially proves that FreeBSD's fsync() flushes all buffers associated with the file. Conversely, the Solaris manpage says: The fsync() function moves all modified data and attributes of the file descriptor fildes to a storage device. When fsync() returns, all in-memory modified copies of buffers associated with fildes have been written to the physical medium. It's pretty clear from the Solaris description that its fsync() concerns itself only with the buffers associated with a file descriptor and not with the file itself. The fact that it's implemented as a library call (the manpage is in section 3 instead of section 2) convinces me further that its fsync() implementation is as described. The PostgreSQL default for checkpoints should probably be sync(), but I think fsync() should be an available option, just as it's possible to control whether or not synchronous writes are used for the transaction log as well as the type of synchronization mechanism used for it. Yes, it's another parameter for the administrator to concern himself with, but it seems to me that a significant amount of speed could be gained under certain (perhaps quite common) circumstances with such a mechanism. -- Kevin Brown [EMAIL PROTECTED] ---------------------------(end of broadcast)--------------------------- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly