On Thu, Jul 19, 2012 at 2:57 PM, Tom Lane <t...@sss.pgh.pa.us> wrote: > Robert Haas <robertmh...@gmail.com> writes: >> On Thu, Jul 19, 2012 at 10:09 AM, Tom Lane <t...@sss.pgh.pa.us> wrote: >>> What about checking just the immediately previous entry? This would >>> at least fix the problem for bulk-load situations, and the cost ought >>> to be about negligible compared to acquiring the LWLock. > >> 2. You say "fix the problem" but I'm not exactly clear what problem >> you think this fixes. > > What I'm concerned about is that there is going to be a great deal more > fsync request queue traffic in 9.2 than there ever was before, as a > consequence of the bgwriter/checkpointer split. The design expectation > for this mechanism was that most fsync requests would be generated > locally inside the bgwriter and thus go straight into the hash table > without having to go through the shared-memory queue. I admit that > we have seen no benchmarks showing that there's a problem, but that's > because up till yesterday the bgwriter was failing to transmit such > messages at all. So I'm looking for ways to cut the overhead. > > But having said that, maybe we should not panic until we actually see > some benchmarks showing the problem.
+1 for not panicking. I'm prepared to believe that there could be a problem here, but I'm not prepared to believe that we've characterized it well enough to be certain that any changes we choose to make will make things better not worse. > Meanwhile, we do know there's a problem with FORGET_RELATION_FSYNC. > I have been looking at the two-hash-tables design I suggested before, > and realized that there's a timing issue: if we just stuff "forget" > requests into a separate table, there is no method for determining > whether a given fsync request arrived before or after a given forget > request. This is problematic if the relfilenode gets recycled: we > need to be able to guarantee that a previously-posted forget request > won't cancel a valid fsync for the new relation. I believe this is > soluble though, if we merge the "forget" requests with unlink requests, > because a relfilenode can't be recycled until we do the unlink. > So as far as the code goes: > > 1. Convert the PendingUnlinkEntry linked list to a hash table keyed by > RelFileNode. It acts the same as before, and shouldn't be materially > slower to process, but now we can determine in O(1) time whether there > is a pending unlink for a relfilenode. > > 2. Treat the existence of a pending unlink request as a relation-wide > fsync cancel; so the loop in mdsync needs one extra hashtable lookup > to determine validity of a PendingOperationEntry. As before, this > should not matter much considering that we're about to do an fsync(). > > 3. Tweak mdunlink so that it does not send a FORGET_RELATION_FSYNC > message if it is sending an UNLINK_RELATION_REQUEST. (A side benefit > is that this gives us another 2X reduction in fsync queue traffic, > and not just any queue traffic but the type of traffic that we must > not fail to queue.) > > The FORGET_RELATION_FSYNC code path will still exist, and will still > require a full hashtable scan, but we don't care because it isn't > being used in common situations. It would only be needed for stuff > like killing an init fork. > > The argument that this is safe involves these points: > > * mdunlink cannot send UNLINK_RELATION_REQUEST until it's done > ftruncate on the main fork's first segment, because otherwise that > segment could theoretically get unlinked from under it before it can do > the truncate. But this is okay since the ftruncate won't cause any > fsync the checkpointer might concurrently be doing to fail. The > request *will* be sent before we unlink any other files, so mdsync > will be able to recover if it gets an fsync failure due to concurrent > unlink. > > * Because a relfilenode cannot be recycled until we process and delete > the PendingUnlinkEntry during mdpostckpt, it is not possible for valid > new fsync requests to arrive while the PendingUnlinkEntry still exists > to cause them to be considered canceled. > > * Because we only process and delete PendingUnlinkEntrys that have been > there since before the checkpoint started, we can be sure that any > PendingOperationEntrys referring to the relfilenode will have been > scanned and deleted by mdsync before we remove the PendingUnlinkEntry. > > Unless somebody sees a hole in this logic, I'll go make this happen. What if we change the hash table to have RelFileNode as the key and an array of MAX_FORKNUM bitmapsets as the value? Then when you get a "forget" request, you can just zap all the sets to empty. That seems a whole lot simpler than your proposal and I don't see any real downside. I can't actually poke a whole in your logic at the moment but a simpler system that requires no assumptions about filesystem behavior seems preferable to me. You can still make an unlink request imply a corresponding forget-request if you want, but now that's a separate optimization. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (email@example.com) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers