On Thu, Jul 19, 2012 at 2:57 PM, Tom Lane <t...@sss.pgh.pa.us> wrote:
> Robert Haas <robertmh...@gmail.com> writes:
>> On Thu, Jul 19, 2012 at 10:09 AM, Tom Lane <t...@sss.pgh.pa.us> wrote:
>>> What about checking just the immediately previous entry?  This would
>>> at least fix the problem for bulk-load situations, and the cost ought
>>> to be about negligible compared to acquiring the LWLock.
>> 2. You say "fix the problem" but I'm not exactly clear what problem
>> you think this fixes.
> What I'm concerned about is that there is going to be a great deal more
> fsync request queue traffic in 9.2 than there ever was before, as a
> consequence of the bgwriter/checkpointer split.  The design expectation
> for this mechanism was that most fsync requests would be generated
> locally inside the bgwriter and thus go straight into the hash table
> without having to go through the shared-memory queue.  I admit that
> we have seen no benchmarks showing that there's a problem, but that's
> because up till yesterday the bgwriter was failing to transmit such
> messages at all.  So I'm looking for ways to cut the overhead.
> But having said that, maybe we should not panic until we actually see
> some benchmarks showing the problem.

+1 for not panicking.  I'm prepared to believe that there could be a
problem here, but I'm not prepared to believe that we've characterized
it well enough to be certain that any changes we choose to make will
make things better not worse.

> Meanwhile, we do know there's a problem with FORGET_RELATION_FSYNC.
> I have been looking at the two-hash-tables design I suggested before,
> and realized that there's a timing issue: if we just stuff "forget"
> requests into a separate table, there is no method for determining
> whether a given fsync request arrived before or after a given forget
> request.  This is problematic if the relfilenode gets recycled: we
> need to be able to guarantee that a previously-posted forget request
> won't cancel a valid fsync for the new relation.  I believe this is
> soluble though, if we merge the "forget" requests with unlink requests,
> because a relfilenode can't be recycled until we do the unlink.
> So as far as the code goes:
> 1. Convert the PendingUnlinkEntry linked list to a hash table keyed by
> RelFileNode.  It acts the same as before, and shouldn't be materially
> slower to process, but now we can determine in O(1) time whether there
> is a pending unlink for a relfilenode.
> 2. Treat the existence of a pending unlink request as a relation-wide
> fsync cancel; so the loop in mdsync needs one extra hashtable lookup
> to determine validity of a PendingOperationEntry.  As before, this
> should not matter much considering that we're about to do an fsync().
> 3. Tweak mdunlink so that it does not send a FORGET_RELATION_FSYNC
> message if it is sending an UNLINK_RELATION_REQUEST.  (A side benefit
> is that this gives us another 2X reduction in fsync queue traffic,
> and not just any queue traffic but the type of traffic that we must
> not fail to queue.)
> The FORGET_RELATION_FSYNC code path will still exist, and will still
> require a full hashtable scan, but we don't care because it isn't
> being used in common situations.  It would only be needed for stuff
> like killing an init fork.
> The argument that this is safe involves these points:
> * mdunlink cannot send UNLINK_RELATION_REQUEST until it's done
> ftruncate on the main fork's first segment, because otherwise that
> segment could theoretically get unlinked from under it before it can do
> the truncate.  But this is okay since the ftruncate won't cause any
> fsync the checkpointer might concurrently be doing to fail.  The
> request *will* be sent before we unlink any other files, so mdsync
> will be able to recover if it gets an fsync failure due to concurrent
> unlink.
> * Because a relfilenode cannot be recycled until we process and delete
> the PendingUnlinkEntry during mdpostckpt, it is not possible for valid
> new fsync requests to arrive while the PendingUnlinkEntry still exists
> to cause them to be considered canceled.
> * Because we only process and delete PendingUnlinkEntrys that have been
> there since before the checkpoint started, we can be sure that any
> PendingOperationEntrys referring to the relfilenode will have been
> scanned and deleted by mdsync before we remove the PendingUnlinkEntry.
> Unless somebody sees a hole in this logic, I'll go make this happen.

What if we change the hash table to have RelFileNode as the key and an
array of MAX_FORKNUM bitmapsets as the value?  Then when you get a
"forget" request, you can just zap all the sets to empty.  That seems
a whole lot simpler than your proposal and I don't see any real
downside.  I can't actually poke a whole in your logic at the moment
but a simpler system that requires no assumptions about filesystem
behavior seems preferable to me.

You can still make an unlink request imply a corresponding
forget-request if you want, but now that's a separate optimization.

Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:

Reply via email to