It appears the fsync problem is pervasive. Here's Linux 2.4.19's
version from fs/buffer.c:
ret = filemap_fdatasync(inode->i_mapping);
err = file->f_op->fsync(file, dentry, 1);
if (err && !ret)
ret = err;
err = filemap_fdatawait(inode->i_mapping);
if (err && !ret)
ret = err;
But this is probably not a big factor as you outline below because
the WALWriteLock is causing the same kind of contention.
tom lane wrote:
> This is kind of ugly in general terms but I'm not sure that it really
> hurts Postgres. In our present scheme, the only files we ever fsync()
> are WAL log files, not data files. And in normal operation there is
> only one WAL writer at a time, and *no* WAL readers. So an exclusive
> kernel-level lock on a WAL file while we fsync really shouldn't create
> any problem for us. (Unless this indirectly blocks other operations
> that I'm missing?)
I hope you're right but I see some very similar contention problems in
the case of many small transactions because of the WALWriteLock.
Assume Transaction A which writes a lot of buffers and XLog entries,
so the Commit forces a relatively lengthy fsynch.
Transactions B - E block not on the kernel lock from fsync but on
When A finishes the fsync and subsequently releases the WALWriteLock
B unblocks and gets the WALWriteLock for its fsync for the flush.
C blocks on the WALWriteLock waiting to write its XLOG_XACT_COMMIT.
B Releases and now C writes its XLOG_XACT_COMMIT.
There now seems to be a lot of contention on the WALWriteLock. This
is a shame for a system that has no locking at the logical level and
therefore seems like it could be very, very fast and offer
> As I commented before, I think we could do with an extra process to
> issue WAL writes in places where they're not in the critical path for
> a foreground process. But that seems to be orthogonal from this issue.
It's only orthogonal to the fsync-specific contention issue. We now
have to worry about WALWriteLock semantics causes the same contention.
Your idea of a separate LogWriter process could very nicely solve this
problem and accomplish a few other things at the same time if we make
a few enhancements.
Back-end servers would not issue fsync calls. They would simply block
waiting until the LogWriter had written their record to the disk, i.e.
until the sync'd block # was greater than the block that contained the
XLOG_XACT_COMMIT record. The LogWriter could wake up committed back-
ends after its log write returns.
The log file would be opened O_DSYNC, O_APPEND every time. The LogWriter
would issue writes of the optimal size when enough data was present or
of smaller chunks if enough time had elapsed since the last write.
The nice part is that the WALWriteLock semantics could be changed to
allow the LogWriter to write to disk while WALWriteLocks are acquired
by back-end servers. WALWriteLocks would only be held for the brief time
needed to copy the entries into the log buffer. The LogWriter would
only need to grab a lock to determine the current end of the log
buffer. Since it would be writing blocks that occur earlier in the
cache than the XLogInsert log writers it won't need to grab a
WALWriteLock before writing the cache buffers.
Many transactions would commit on the same fsync (now really a write
with O_DSYNC) and we would get optimal write throughput for the log
This would handle all the issues I had and it doesn't sound like a
huge change. In fact, it ends up being almost semantically identical
to the aio_write suggestion I made orignally, except the
LogWriter is doing the background writing instead of the OS and we
don't have to worry about aio implementations and portability.
---------------------------(end of broadcast)---------------------------
TIP 6: Have you searched our list archives?