On Fri, Sep 11, 2015 at 11:51 AM, Thomas Munro < thomas.mu...@enterprisedb.com> wrote:
> On Fri, Sep 11, 2015 at 10:45 AM, Alvaro Herrera <alvhe...@2ndquadrant.com > > wrote: > >> Bernd Helmle wrote: >> > A customer had a severe issue with a PostgreSQL 9.3.9/sparc64/Solaris 11 >> > instance. >> > >> > The database crashed with the following log messages: >> > >> > 2015-09-08 00:49:16 CEST [2912] PANIC: could not access status of >> > transaction 1068235595 >> > 2015-09-08 00:49:16 CEST [2912] DETAIL: Could not open file >> > "pg_multixact/members/FFFF5FC4": No such file or directory. >> > 2015-09-08 00:49:16 CEST [2912] STATEMENT: delete from StockTransfer >> > where oid = $1 and tanum = $2 >> >> I wonder if these bogus page and offset numbers are just >> SlruReportIOError being confused because pg_multixact/members is so >> weird (I don't think it should be the case, since this stuff is using >> page numbers only, not anything related to how each page is layed out). >> > > But SlruReportIOError uses the same macro to build the filename as > SlruReadPhysicalPage and other functions, namely SlruFileName which uses > sprintf with %04X (unsigned integer uppercase hex) and gives it segno > (which is always an int), so I don't think the problem is in error > reporting only. > > Assuming default block size, to get FFFF5FC4 from SlruFileName you need > segno == -41020. > Oops, I meant to attach the proviso "Assuming default block size" to the assumption further down that MULTIXACT_MEMBERS_PER_PAGE == 1636. > We have int segno = pageno / 32 (that's SLRU_PAGES_PER_SEGMENT), so to get > segno == -41020 you need pageno between -1312640 and -1312609 (whose bit > patterns reinterpreted as unsigned are 4293654656 and 4293654687). > > In various places we have int pageno = offset / (uint32) 1636, expanded > from this macro (which calls the offset an xid): > > #define MXOffsetToMemberPage(xid) ((xid) / (TransactionId) > MULTIXACT_MEMBERS_PER_PAGE) > I don't really see how any uint32 value could produce such a pageno via > that macro. Even if called in an environment where (xid) is accidentally > an int, the int / unsigned expression would convert it to unsigned first > (unless (xid) is a bigger type like int64_t: by the rules of int promotion > you'd get signed division in that case, hmm...). But it's always called > with a MultiXactOffset AKA uint32 variable. > > So via that route, there is no MultiXactOffset value that can't be mapped > to a segment in the range "0000", "14078". Famously, it wraps after that. > > Maybe the negative pageno came from somewhere else. Where? Inside SLRU > code we can see pageno = shared->page_number[slotno]... maybe the SLRU > slots got corrupted somehow? > -- Thomas Munro http://www.enterprisedb.com