On Tue, Nov 7, 2017 at 7:27 AM, Asim Praveen <aprav...@pivotal.io> wrote: > On Mon, Oct 2, 2017 at 6:48 PM, Michael Paquier <michael.paqu...@gmail.com> > wrote: >> Jacob, here are some ideas to make this thread move on. I would >> suggest to produce a set of patches that do things incrementally: >> 1) One patch that changes the calls of PageGetLSN to >> BufferGetLSNAtomic which are now not appropriate. You have spotted >> some of them in the btree and gist code, but not all based on my first >> lookup. There is still one in gistFindCorrectParent and one in btree >> _bt_split. The monitoring of the other calls (sequence.c and >> vacuumlazy.c) looked safe. There is another one in XLogRecordAssemble >> that should be fixed I think. > > Thank you for your suggestions. Please find the first patch attached as > "0001-...". We verified both, gistFindCorrectParent and _bt_split and all > calls to PageGetLSN are made with exclusive lock on the buffer contents > held.
Cool. Thanks for double-checking. XLogRecordAssemble() is fine after more lookup at this code, XLogRegisterBuffer already doing some sanity checks. >> 2) A second patch that strengthens a bit checks around >> BufferGetLSNAtomic. One idea would be to use LWLockHeldByMe, as you >> are originally suggesting. >> A comment could be as well added in bufpage.h for PageGetLSN to let >> users know that it should be used carefully, in the vein of what is >> mentioned in src/backend/access/transam/README. > > The second patch "0002-..." does the above. We have a comment added to > AssertPageIsLockedForLSN as suggested. Did you really test WAL replay? This still ignores that PageGetLSN is as well taken in some code paths, like recovery, where actions on the page are guaranteed to be serialized, like during recovery, so this patch would cause the system to blow up. Note that pageinspect, amcheck and wal_consistency_checking also process on page copies. So the assertion failure of 0002 would trigger in those cases. > The assertion added caught at least one code path where TestForOldSnapshot > calls PageGetLSN without holding any lock. The snapshot_too_old test in > "check-world" failed due to the assertion failure. This needs to be fixed, > see the open question in the opening mail on this thread. Good point. I am looping Kevin Grittner here for his input, as the author and committer of old_snapshot_threshold. Things can be addressed with a separate patch by roughly moving the check on PageGetLSN() to TestForOldSnapshot_impl() and use BufferGetLSNAtomic() instead. The commit fest has lost view of this entry, and so did I. So I have added a new entry: https://commitfest.postgresql.org/16/1371/ BufferGetLSNAtomic() could really use LWLockHeldByMe(). Could you consider it with an extra patch on top of 0001? It seems to me that 0001 is good for a committer lookup, that will get rid of all existing bugs. For 0002, what you are proposing is still not a good idea for anything using page copies. Here are some suggestions: - Implement a PageGetLSNFromCopy, dedicated at working correctly when working on a page copy. Then switch callers of amcheck, pageinspect and wal_consistency_checking to use that. - Implement something like GetLSNFromLockedPage, and switch of backend's PageGetLSN to that. Performance impact could be seen.. - Have a PageGetLSNSafe, which can be used safely for serialized actions. It could be an idea to remove PageGetLSN to force a breakage of extensions calling it, so as they would review any of its calls. Not a fan of that though. -- Michael -- Sent via pgsql-hackers mailing list (firstname.lastname@example.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers