Hello all, While working on checksum support for GPDB, we noticed that several callers of PageGetLSN() didn't follow the correct locking procedure. To try to help ferret out present and future mistakes, we added an assertion to PageGetLSN() that checks whether those locks were being held correctly. This patch is a first-draft attempt to port that assertion back up to postgres master, based on work by Asim Praveen, Ashwin Agrawal, and myself.
The patch is really two pieces: add the assertion, and fix the callers that would trip it. The latter part is still in progress, because I'm running into some places where I'm not sure what the correct way forward is. (I'm a newbie to this list and this code base, so please don't hesitate to correct me on anything, whether that's code- or culture-related!) = Notes/Observations = - This change introduces a subtle source incompatibility: whereas previously both Pages and PageHeaders could be passed to PageGetLSN(), now callers must explicitly cast to Page if they don't already have one. - I originally tried to put (and the GPDB patch succeeds in putting) the entire assertion implementation into PageGetLSN in bufpage.h. That seems unworkable here -- buf_internals.h is no longer publicized, and the assertion needs those internals to perform its checks. So I moved the assertion into its own helper function in bufmgr.c. If assertions are disabled, that helper declaration gets turned into an empty macro, so there shouldn't be a performance hit. = Open Questions = - Is my use of FRONTEND here correct? The documentation made it seem like some compilers could try to link against the AssertPageIsLockedForLSN function, even if PageGetLSN was never called. - Use of PageGetLSN() in heapam.c seems to be incorrect, but I don't really know what the best way is to fix it. It explicitly unlocks the buffer, with the comment that visibilitymap_set() needs to handle locking itself, but then calls PageGetLSN() to determine whether visibilitymap_set() needs to be called. - TestForOldSnapshot is also problematic. The function is called in places where only a shared lock is held, so it needs to hold the header spinlock at least some of the time, but we only pass the Page as an argument. I could do the same "find buffer descriptor from page pointer" logic that we utilize in the assertion implementation, but is there a better way? - Should I try to add this assertion to PageSetLSN() as well? We were focused mostly on the GetLSN side of things, since that was the more risky piece of our backport. PageSetLSN is called from many more places, and the review there would be much larger. = Known Issues = - This approach doesn't seem to work when checksums are disabled. In that case, BufferGetLSNAtomic isn't actually atomic, so it seems to always trip the assertion. I'm not sure of the best way to proceed -- is it really correct not to follow the locking contract if checksums are disabled? What do you think? Is this something worth pursuing? Any and all comments welcome. Thanks! --Jacob
PageGetLSN-assert-locks-held.patch
Description: Binary data
-- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers