On 12/27/12 7:43 AM, Greg Stark wrote:
If it's always the first buffer then it could conceivably still be
some other heap allocated object that always lands before
LocalRefCount. It does seem a bit weird to be storing 1<<30 though --
there are no 1<<30 constants that we might be storing for example.

It is a strange power of two to be appearing there. I can follow your reasoning for why this could be a bit flipping error. There's no sign of that elsewhere though, no other crashes under load. I'm using this server here because it's worked fine for a while now.

I added printing the buffer number, and they're all over the place:

2012-12-27 06:36:39 EST [26306]: WARNING: refcount of buf 29270 containing base/16384/90124 blockNum=82884, flags=0x127 is 1073741824 should be 0, globally: 0 2012-12-27 02:08:19 EST [21719]: WARNING: refcount of buf 114262 containing base/16384/81932 blockNum=133333, flags=0x106 is 1073741824 should be 0, globally: 0 2012-12-26 20:03:05 EST [15117]: WARNING: refcount of buf 142934 containing base/16384/73740 blockNum=87961, flags=0x127 is 1073741824 should be 0, globally: 0

The relation continues to bounce between pgbench_accounts and its primary key, no pattern there either I can see. To answer a few other questions: this system does not have ECC RAM. It did survive many passes of memtest86+ without any problems though, right after the above.

I tried duplicating the problem on a similar server. It keeps hanging due to some Linux software RAID bug before it runs for very long. Whatever is going on here, it really doesn't want to be discovered.

For reference sake, the debugging code those latest messages came from is now:

diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index dddb6c0..60d3ad3 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -1697,11 +1697,27 @@ AtEOXact_Buffers(bool isCommit)
        if (assert_enabled)
        {
                int                     i;
+               int                     RefCountErrors = 0;

                for (i = 0; i < NBuffers; i++)
                {
-                       Assert(PrivateRefCount[i] == 0);
+
+                       if (PrivateRefCount[i] != 0)
+                       {
+                               /*
+ PrintBufferLeakWarning(&BufferDescriptors[i]);
+                               */
+                               BufferDesc *bufHdr = &BufferDescriptors[i];
+                               elog(WARNING,
+ "refcount of buf %d containing %s blockNum=%u, flags=0x%x is %u should be 0, globally: %u", + i,relpathbackend(bufHdr->tag.rnode, InvalidBackendId, bufHdr->tag.forkNum), + bufHdr->tag.blockNum, bufHdr->flags, PrivateRefCount[i], bufHdr->refcount);
+                               RefCountErrors++;
+                       }
                }
+               if (RefCountErrors > 0)
+ elog(WARNING, "buffers with non-zero refcount is %d", RefCountErrors);
+               Assert(RefCountErrors == 0);
        }
 #endif



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Reply via email to