On Sun, Jul 15, 2012 at 2:29 PM, Tom Lane <t...@sss.pgh.pa.us> wrote: > I think what we ought to do is bite the bullet and refactor the > representation of the pendingOps table. What I'm thinking about > is reducing the hash key to just RelFileNodeBackend + ForkNumber, > so that there's one hashtable entry per fork, and then storing a > bitmap to indicate which segment numbers need to be sync'd. At > one gigabyte to the bit, I think we could expect the bitmap would > not get terribly large. We'd still have a "cancel" flag in each > hash entry, but it'd apply to the whole relation fork not each > segment.
I think this is a good idea. >> Also, I still wonder if it is worth memorizing fsyncs (under >> fsync=off) that may or may not ever take place. Is there any >> guarantee that we can make by doing so, that couldn't be made >> otherwise? > > Yeah, you have a point there. It's not real clear that switching fsync > from off to on is an operation that we can make any guarantees about, > short of executing something like the code recently added to initdb > to force-sync the entire PGDATA tree. Perhaps we should change fsync > to be PGC_POSTMASTER (ie frozen at postmaster start), and then we could > skip forwarding fsync requests when it's off? I am emphatically opposed to making fsync PGC_POSTMASTER. Being able to change parameters on the fly without having to shut down the system is important, and we should be looking for ways to make it possible to change more things on-the-fly, not arbitrarily restricting GUCs that already exist. This is certainly one I've changed on the fly, and I'm willing to bet there are real-world users out there who have done the same (e.g. to survive an unexpected load spike). I would argue that such a change adds no measure of safety, anyway. Suppose we have the following sequence of events, starting with fsync=off: T0: write T1: checkpoint (fsync of T0 skipped since fsync=off) T2: write T3: fsync=on T4: checkpoint (fsync of T2 performed) Why is it OK to fsync the write at T2 but not the one at T0? In order for the system to become crash-safe, the user will need to guarantee, at some point following T3, that the entire OS buffer cache has been flushed to disk. Whether or not the fsync of T2 happened is irrelevant. Had we chosen not to send an fsync request at all at time T2, the user's obligations following T3 would be entirely unchanged. Thus, I see no reason why we need to restrict the fsync setting in order to implement the proposed optimization. But, at a broader level, I am not very excited about this optimization. It seems to me that if this is hurting enough to be noticeable, then it's hurting us when fsync=on as well, and we had maybe think a little harder about how to cut down on the IPC overhead. If the bgwriter comm lock is contended, we could partition it - e.g. by giving each backend a small queue protected by the backendLock, which is flushed into the main queue when it fills and harvested by the bgwriter once per checkpoint cycle. (This is the same principle as the fast-path locking stuff that we used to eliminate lmgr contention on short read-only queries in 9.2.) If we only fix it for the fsync=off case, then what about people who are running with fsync=on but have extremely fast fsyncs? Most of us probably don't have the hardware to test that today but it's certainly out there and will probably become more common in the future. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (firstname.lastname@example.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers