On Fri, Sep 11, 2015 at 10:45 AM, Alvaro Herrera <alvhe...@2ndquadrant.com>
wrote:

> Bernd Helmle wrote:
> > A customer had a severe issue with a PostgreSQL 9.3.9/sparc64/Solaris 11
> > instance.
> >
> > The database crashed with the following log messages:
> >
> > 2015-09-08 00:49:16 CEST [2912] PANIC:  could not access status of
> > transaction 1068235595
> > 2015-09-08 00:49:16 CEST [2912] DETAIL:  Could not open file
> > "pg_multixact/members/FFFF5FC4": No such file or directory.
> > 2015-09-08 00:49:16 CEST [2912] STATEMENT:  delete from StockTransfer
> > where oid = $1 and tanum = $2
>
> I wonder if these bogus page and offset numbers are just
> SlruReportIOError being confused because pg_multixact/members is so
> weird (I don't think it should be the case, since this stuff is using
> page numbers only, not anything related to how each page is layed out).
>

But SlruReportIOError uses the same macro to build the filename as
SlruReadPhysicalPage and other functions, namely SlruFileName which uses
sprintf with %04X (unsigned integer uppercase hex) and gives it segno
(which is always an int), so I don't think the problem is in error
reporting only.

Assuming default block size, to get FFFF5FC4 from SlruFileName you need
segno == -41020.

We have int segno = pageno / 32 (that's SLRU_PAGES_PER_SEGMENT), so to get
segno == -41020 you need pageno between -1312640 and -1312609 (whose bit
patterns  reinterpreted as unsigned are 4293654656 and 4293654687).

In various places we have int pageno = offset / (uint32) 1636, expanded
from this macro (which calls the offset an xid):

#define MXOffsetToMemberPage(xid) ((xid) / (TransactionId)
MULTIXACT_MEMBERS_PER_PAGE)
I don't really see how any uint32 value could produce such a pageno via
that macro.  Even if called in an environment where (xid) is accidentally
an int, the int / unsigned expression would convert it to unsigned first
(unless (xid) is a bigger type like int64_t: by the rules of int promotion
you'd get signed division in that case, hmm...).  But it's always called
with a MultiXactOffset AKA uint32 variable.

So via that route, there is no MultiXactOffset value that can't be mapped
to a segment in the range "0000", "14078".  Famously, it wraps after that.

Maybe the negative pageno came from somewhere else.  Where?  Inside SLRU
code we can see pageno = shared->page_number[slotno]... maybe the SLRU
slots got corrupted somehow?

-- 
Thomas Munro
http://www.enterprisedb.com

Reply via email to