Robert Haas <robertmh...@gmail.com> writes: > On Thu, Jun 16, 2011 at 6:54 PM, Tom Lane <t...@sss.pgh.pa.us> wrote: >> I believe that this is fundamentally unavoidable so long as we use >> SnapshotNow to read catalogs --- which is something we've talked about >> changing, but it will require a pretty major R&D effort to make it >> happen.
> Ouch. > I wonder if we could avoid this anomaly by taking a throwaway MVCC > snapshot at the beginning of each system catalog scan and using it > just for the duration of that scan. If nothing that has touched the > catalog commits while the scan is open, then this is logically > equivalent to SnapshotNow. If something does commit in mid-scan, then > we might not get the latest version of the row, but we should end up > with exactly one. If it's not the latest one, we'll do the rebuild > again upon seeing the next sinval message; in the meantime, the > version we're using mustn't be too intolerably bad or it was an error > not to use AccessExclusiveLock in the first place. Yeah, this seems like a possibly workable direction to explore. I like this better than what Simon is proposing, because it would fix the generic issue for all types of catalog SnapshotNow scans. > IIUC, the problem with this approach is not correctness but > performance. Taking snapshots is (currently) expensive. Yeah. After mulling it for awhile, what about this idea: we could redefine SnapshotNow as a snapshot type that includes a list of transactions-in-progress, somewhat like an MVCC snapshot, but we don't fill that list from the PGPROC array. Instead, while running a scan with SnapshotNow, anytime we determine that a particular XID is still-in-progress, we add that XID to the snapshot's list. Subsequently, the SnapshotNow code assumes that XID to be still-in-progress without consulting its actual state. We reset the XID list to empty when starting a new SnapshotNow scan. (We might be able to do so less often than that, like only when we do AcceptInvalidationMessages, but it's not clear to me that there's any real benefit in hanging onto the state longer.) This costs no performance; if anything it should be faster than now, because we'll be replacing expensive transaction state probes with relatively-cheap searches of an XID array that should almost always be quite short. With this approach, we would have no serialization anomalies from single transactions committing while a scan is in progress. There could be anomalies resulting from considering an earlier XID to be in-progress while a later XID is considered committed (because we didn't observe it until later). So far as I can see offhand, the impact of that would be that there might be multiple versions of a tuple that are considered good, but never that there would be no version considered good (so long as the other XIDs simply updated the tuple and didn't delete it). I think this would be all right, since the scan would just seize on the first good version it finds. As you argue above, if that's not good enough for our purposes then the updater(s) should have taken a stronger lock. I am not, however, particularly pleased with the idea of trying to make this work in 9.1. I still think that we should back off the attempt to reduce lock strength in 9.1, and take it up for 9.2. We need to be stabilizing 9.1 for release, not introducing new untested mechanisms in it. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers