On Thu, Sep 6, 2012 at 3:04 AM, Noah Misch <n...@leadboat.com> wrote:
> On Tue, Sep 04, 2012 at 09:46:58AM -0700, Daniel Farina wrote:
>> I might try to find the segments leading up to the overflow point and
>> try xlogdumping them to see what we can see.
>
> That would be helpful to see.
>
> Just to grasp at yet-flimsier straws, could you post (URL preferred, else
> private mail) the output of "objdump -dS" on your "postgres" executable?

https://dl.dropbox.com/s/444ktxbrimaguxu/txid-wrap-objdump-dS-postgres.txt.gz

Sure, it's a 9.0.6 with pg_cancel_backend by-same-role backported
along with the standard debian changes, so nothing all that
interesting should be going on that isn't going on normally with
compilers on this platform.  I am also starting to grovel through this
assembly, although I don't have a ton of experience finding problems
this way.

To save you a tiny bit of time aligning the assembly with the C, this line

   c797f:       e8 7c c9 17 00          callq  244300 <LWLockAcquire>

Seems to be the beginning of:

        LWLockAcquire(XidGenLock, LW_SHARED);
        checkPoint.nextXid = ShmemVariableCache->nextXid;
        checkPoint.oldestXid = ShmemVariableCache->oldestXid;
        checkPoint.oldestXidDB = ShmemVariableCache->oldestXidDB;
        LWLockRelease(XidGenLock);


>> If there's anything to note about the workload, I'd say that it does
>> tend to make fairly pervasive use of long running transactions which
>> can span probably more than one checkpoint, and the txid reporting
>> functions, and a concurrency level of about 300 or so backends ... but
>> per my reading of the mechanism so far, it doesn't seem like any of
>> this should matter.
>
> Thanks for the details; I agree none of that sounds suspicious.
>
> After some further pondering and testing, this remains a mystery to me.  These
> symptoms imply a proper update of ControlFile->checkPointCopy.nextXid without
> having properly updated ControlFile->checkPointCopy.nextXidEpoch.  After
> recovery, only CreateCheckPoint() updates ControlFile->checkPointCopy at all.
> Its logic for doing so looks simple and correct.

Yeah.  I'm pretty flabbergasted that so much seems to be going right
while this goes wrong.

-- 
fdr


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Reply via email to