On Sun, Jul 15, 2012 at 2:29 PM, Tom Lane <t...@sss.pgh.pa.us> wrote:
> I think what we ought to do is bite the bullet and refactor the
> representation of the pendingOps table.  What I'm thinking about
> is reducing the hash key to just RelFileNodeBackend + ForkNumber,
> so that there's one hashtable entry per fork, and then storing a
> bitmap to indicate which segment numbers need to be sync'd.  At
> one gigabyte to the bit, I think we could expect the bitmap would
> not get terribly large.  We'd still have a "cancel" flag in each
> hash entry, but it'd apply to the whole relation fork not each
> segment.

I think this is a good idea.

>> Also, I still wonder if it is worth memorizing fsyncs (under
>> fsync=off) that may or may not ever take place.  Is there any
>> guarantee that we can make by doing so, that couldn't be made
>> otherwise?
> Yeah, you have a point there.  It's not real clear that switching fsync
> from off to on is an operation that we can make any guarantees about,
> short of executing something like the code recently added to initdb
> to force-sync the entire PGDATA tree.  Perhaps we should change fsync
> to be PGC_POSTMASTER (ie frozen at postmaster start), and then we could
> skip forwarding fsync requests when it's off?

I am emphatically opposed to making fsync PGC_POSTMASTER.  Being able
to change parameters on the fly without having to shut down the system
is important, and we should be looking for ways to make it possible to
change more things on-the-fly, not arbitrarily restricting GUCs that
already exist.  This is certainly one I've changed on the fly, and I'm
willing to bet there are real-world users out there who have done the
same (e.g. to survive an unexpected load spike).

I would argue that such a change adds no measure of safety, anyway.
Suppose we have the following sequence of events, starting with

T0: write
T1: checkpoint (fsync of T0 skipped since fsync=off)
T2: write
T3: fsync=on
T4: checkpoint (fsync of T2 performed)

Why is it OK to fsync the write at T2 but not the one at T0?  In order
for the system to become crash-safe, the user will need to guarantee,
at some point following T3, that the entire OS buffer cache has been
flushed to disk.  Whether or not the fsync of T2 happened is
irrelevant.  Had we chosen not to send an fsync request at all at time
T2, the user's obligations following T3 would be entirely unchanged.
Thus, I see no reason why we need to restrict the fsync setting in
order to implement the proposed optimization.

But, at a broader level, I am not very excited about this
optimization.  It seems to me that if this is hurting enough to be
noticeable, then it's hurting us when fsync=on as well, and we had
maybe think a little harder about how to cut down on the IPC overhead.
 If the bgwriter comm lock is contended, we could partition it - e.g.
by giving each backend a small queue protected by the backendLock,
which is flushed into the main queue when it fills and harvested by
the bgwriter once per checkpoint cycle.  (This is the same principle
as the fast-path locking stuff that we used to eliminate lmgr
contention on short read-only queries in 9.2.)  If we only fix it for
the fsync=off case, then what about people who are running with
fsync=on but have extremely fast fsyncs?  Most of us probably don't
have the hardware to test that today but it's certainly out there and
will probably become more common in the future.

Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:

Reply via email to