I'm testing a checkout from a few days ago and trying to complete a day long pgbench stress test, with assertions and debugging on. I want to make sure the base code works as expected before moving on to testing checksums. It's crashing before finishing though. Here's a sample:

2012-12-20 20:36:11 EST [26613]: LOG: checkpoint complete: wrote 32 buffers (0.0%); 0 transaction log file(s) added, 0 removed, 0 recycled; write=3.108 s, sync=0.978 s, total=4.187 s; sync files=7, longest=0.832 s, average=0.139 s TRAP: FailedAssertion("!(PrivateRefCount[i] == 0)", File: "bufmgr.c", Line: 1703) 2012-12-20 20:36:44 EST [26611]: LOG: server process (PID 4834) was terminated by signal 6: Aborted
2012-12-20 20:36:44 EST [26611]: DETAIL:  Failed process was running: END;

Running the same test set again gave the same crash, so this looks repeatable. It's not trivial to trigger given the average runtime to crash I'm seeing. Second sample:

2012-12-22 21:27:17 EST [6006]: LOG: checkpoint complete: wrote 34 buffers (0.0%); 0 transaction log file(s) added, 0 removed, 0 recycled; write=3.310 s, sync=2.906 s, total=6.424 s; sync files=7, longest=2.648 s, average=0.415 s TRAP: FailedAssertion("!(PrivateRefCount[i] == 0)", File: "bufmgr.c", Line: 1703) 2012-12-22 21:27:23 EST [26611]: LOG: server process (PID 12143) was terminated by signal 6: Aborted
2012-12-22 21:27:23 EST [26611]: DETAIL:  Failed process was running: END;

The important part:

TRAP: FailedAssertion("!(PrivateRefCount[i] == 0)", File: "bufmgr.c", Line: 1703)

That's the check done by "AtEOXact_Buffers - clean up at end of transaction". It fired while executing the "END;"" statement at the end of the standard pgbench test. The test looks for loose things that weren't pinned/unpinned correctly by the transaction. I'm not sure if getting a stack trace at that point will be useful. The ref count mistake could easily have been made by an earlier statement in the transaction block, right?

This is on a server where shared_buffers=2GB, checkpoint_segments=64. Test runs were automated with pgbench-tools, using the following configuration:

MAX_WORKERS="4"
SCRIPT="tpc-b.sql"
SCALES="500 1000"
SETCLIENTS="4 8 16 32 64 96"
SETTIMES=3
RUNTIME=600

All of the crashes so far have been at scale=500, and I haven't seen those finish yet. It failed 7 hours in the first time, then 4 hours in.

For code reference, the last commit in the source code tree I built against was:

af275a12dfeecd621ed9899a9382f26a68310263
Thu Dec 20 14:23:31 2012 +0200

Looking at recent activity, the only buffer pin related changes I saw was this one:

commit 62656617dbe49cca140f3da588a5e311b3fc35ea
Date:   Mon Dec 3 18:53:31 2012 +0000
    Avoid holding vmbuffer pin after VACUUM.

This is my first test like this against 9.3 development though, so the cause could be an earlier commit. I'm just starting with the most recent work as the first suspect. Next I think I'll try autovacuum=off and see if the crash goes away. Other ideas are welcome.

--
Greg Smith   2ndQuadrant US    g...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Reply via email to