Tom Lane wrote:
> Kevin Brown <[EMAIL PROTECTED]> writes:
> > So the backends have to keep a common list of all the files they
> > touch.  Admittedly, that could be a problem if it means using a bunch
> > of shared memory, and it may have additional performance implications
> > depending on the implementation ...
> It would have to be a list of all files that have been touched since the
> last checkpoint.  That's a serious problem for storage in shared memory,
> which is by definition fixed-size.

Of course, the file list needn't be stored in SysV shared memory.  It
could be stored in a file that's later read by the checkpointing
process.  The backends could serialize their writes via fcntl() or
ioctl() style locks, whichever is appropriate.  Locking might even be
avoided entirely if the individual writes are small enough.

> Right.  "Portably" was the key word in my comment (sorry for not
> emphasizing this more clearly).  The real problem here is how to know
> what is the actual behavior of each platform?  I'm certainly not
> prepared to trust reading-between-the-lines-of-some-man-pages.  

Reading between the lines isn't necessarily required, just literal
interpretation.  :-)

> And I can't think of a simple yet reliable direct test.  You'd
> really have to invest detailed study of the kernel source code to
> know for sure ...  and many of our platforms don't have open-source
> kernels.

Linux appears to do the right thing with the file data itself, even if
it doesn't handle the directory entry simultaneously.  Others claim,
in messages written to pgsql-general and elsewhere (via Google
search), that FreeBSD does the right thing for sure.

I certainly agree that non-open-source kernels are uncertain.  That's
why it wouldn't be a bad idea to control this via a GUC variable.

> > Under Linux (and perhaps HP-UX), it may be necessary to fsync() the
> > directories leading to the file as well, so that the state of the
> > filesystem on disk is consistent and safe in the event that the files
> > in question are newly-created.
> AFAIK, all Unix implementations are paranoid about consistency of
> filesystem metadata, including directory contents.  

Not ext2 under Linux!  By default, it writes everything
asynchronously.  I don't know how many people use ext2 to do serious
tasks under Linux, so this may not be that much of an issue.

> So fsync'ing directories from a user process strikes me as a waste
> of time, even assuming that it were portable, which I doubt.  What
> we need to worry about is whether fsync'ing a bunch of our own data
> files is a practical substitute for a global sync() call.

I'm positive that under certain operating systems, fsyncing the data
is a better option than a global sync(), especially since sync() isn't
guaranteed to wait until the buffers are flushed.  Right now the state
of the data on disk immediately after a checkpoint is just a guess
because of that.  I don't see that using fsync() would introduce
significantly more uncertainty on systems where the manpage explicitly
says that the buffers associated with the file referenced by the file
descriptor are the ones written to disk.  For instance, the FreeBSD
manpage says:

    Fsync() causes all modified data and attributes of fd to be moved
    to a permanent storage device.  This normally results in all
    in-core modified copies of buffers for the associated file to be
    written to a disk.

    Fsync() should be used by programs that require a file to be in a
    known state, for example, in building a simple transaction

and the Linux manpage says:

    fsync copies all in-core parts of a file to disk, and waits until
    the device reports that all parts are on stable storage.  It also
    updates metadata stat information.  It does not necessarily ensure
    that the entry in the directory containing the file has also
    reached disk.  For that an explicit fsync on the file descriptor
    of the directory is also needed.

Both are rather unambiguous, and a cursory review of the Linux source
confirms what its manpage says, at least.  The FreeBSD manpage might
be ambiguous, but the fact that they also have an fsync command line
utility essentially proves that FreeBSD's fsync() flushes all buffers
associated with the file.

Conversely, the Solaris manpage says:

    The fsync() function moves all modified data and attributes of the
    file descriptor fildes to a storage device. When fsync() returns,
    all in-memory modified copies of buffers associated with fildes
    have been written to the physical medium.

It's pretty clear from the Solaris description that its fsync()
concerns itself only with the buffers associated with a file
descriptor and not with the file itself.  The fact that it's
implemented as a library call (the manpage is in section 3 instead of
section 2) convinces me further that its fsync() implementation is as

The PostgreSQL default for checkpoints should probably be sync(), but
I think fsync() should be an available option, just as it's possible
to control whether or not synchronous writes are used for the
transaction log as well as the type of synchronization mechanism used
for it.  Yes, it's another parameter for the administrator to concern
himself with, but it seems to me that a significant amount of speed
could be gained under certain (perhaps quite common) circumstances
with such a mechanism.

Kevin Brown                                           [EMAIL PROTECTED]

---------------------------(end of broadcast)---------------------------
TIP 3: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to [EMAIL PROTECTED] so that your
message can get through to the mailing list cleanly

Reply via email to