Hi Hackers,

We are running Postgres at AIX and encoountered two strqange problems: active zombies process and deadlock in XLOG writer. First problem I will explain in separate mail, now I am mostly concerning about deadlock. It is irregularly reproduced with standard pgbench launched with 100 connections.


It sometimes happens with 9.6 stable version of Postgres but only when it is compiled with xlc compiler. We failed to reproduce the problem with GCC. So it looks like as bug in compiler or xlc-specific atomics implementation...
But there are few moments which contradicts with this hypothesis:

1. The problem is reproduce with Postgres built without optimization. Usually compiler bugs affect only optimized code.
2. Disabling atomics doesn't help.
3. Without optimization and with LOCK_DEBUG defined time of reproducing the problem significantly increased. With optimized code it is almost always reproduced in few minutes.
With debug version it usually takes much more time.

But the most confusing thing is stack trace:

(dbx) where
semop(??, ??, ??) at 0x9000000001f5790
PGSemaphoreLock(sema = 0x0a00000044b95928), line 387 in "pg_sema.c"
unnamed block in LWLockWaitForVar(lock = 0x0a0000000000d980, valptr = 0x0a0000000000d9a8, oldval = 102067874256, newval = 0x0fffffffffff9c10), line 1666 in "lwlock.c" LWLockWaitForVar(lock = 0x0a0000000000d980, valptr = 0x0a0000000000d9a8, oldval = 102067874256, newval = 0x0fffffffffff9c10), line 1666 in "lwlock.c" unnamed block in WaitXLogInsertionsToFinish(upto = 102067874328), line 1583 in "xlog.c"
WaitXLogInsertionsToFinish(upto = 102067874328), line 1583 in "xlog.c"
AdvanceXLInsertBuffer(upto = 102067874256, opportunistic = '\0'), line 1916 in "xlog.c"
unnamed block in GetXLogBuffer(ptr = 102067874256), line 1697 in "xlog.c"
GetXLogBuffer(ptr = 102067874256), line 1697 in "xlog.c"
CopyXLogRecordToWAL(write_len = 70, isLogSwitch = '\0', rdata = 0x000000011007ce10, StartPos = 102067874256, EndPos = 102067874328), line 1279 in "xlog.c" XLogInsertRecord(rdata = 0x000000011007ce10, fpw_lsn = 102067718328), line 1011 in "xlog.c" unnamed block in XLogInsert(rmid = '\n', info = '@'), line 453 in "xloginsert.c"
XLogInsert(rmid = '\n', info = '@'), line 453 in "xloginsert.c"
log_heap_update(reln = 0x0000000110273540, oldbuf = 40544, newbuf = 40544, oldtup = 0x0fffffffffffa2a0, newtup = 0x00000001102bb958, old_key_tuple = (nil), all_visible_cleared = '\0', new_all_visible_cleared = '\0'), line 7708 in "heapam.c" unnamed block in heap_update(relation = 0x0000000110273540, otid = 0x0fffffffffffa6f8, newtup = 0x00000001102bb958, cid = 1, crosscheck = (nil), wait = '^A', hufd = 0x0fffffffffffa5b0, lockmode = 0x0fffffffffffa5c8), line 4212 in "heapam.c" heap_update(relation = 0x0000000110273540, otid = 0x0fffffffffffa6f8, newtup = 0x00000001102bb958, cid = 1, crosscheck = (nil), wait = '^A', hufd = 0x0fffffffffffa5b0, lockmode = 0x0fffffffffffa5c8), line 4212 in "heapam.c" unnamed block in ExecUpdate(tupleid = 0x0fffffffffffa6f8, oldtuple = (nil), slot = 0x00000001102bb308, planSlot = 0x00000001102b4630, epqstate = 0x00000001102b2cd8, estate = 0x00000001102b29e0, canSetTag = '^A'), line 937 in "nodeModifyTable.c" ExecUpdate(tupleid = 0x0fffffffffffa6f8, oldtuple = (nil), slot = 0x00000001102bb308, planSlot = 0x00000001102b4630, epqstate = 0x00000001102b2cd8, estate = 0x00000001102b29e0, canSetTag = '^A'), line 937 in "nodeModifyTable.c"
ExecModifyTable(node = 0x00000001102b2c30), line 1516 in "nodeModifyTable.c"
ExecProcNode(node = 0x00000001102b2c30), line 396 in "execProcnode.c"
ExecutePlan(estate = 0x00000001102b29e0, planstate = 0x00000001102b2c30, use_parallel_mode = '\0', operation = CMD_UPDATE, sendTuples = '\0', numberTuples = 0, direction = ForwardScanDirection, dest = 0x00000001102b7520), line 1569 in "execMain.c" standard_ExecutorRun(queryDesc = 0x00000001102b25c0, direction = ForwardScanDirection, count = 0), line 338 in "execMain.c" ExecutorRun(queryDesc = 0x00000001102b25c0, direction = ForwardScanDirection, count = 0), line 286 in "execMain.c" ProcessQuery(plan = 0x00000001102b1510, sourceText = "UPDATE pgbench_tellers SET tbalance = tbalance + 4019 WHERE tid = 6409;", params = (nil), dest = 0x00000001102b7520, completionTag = ""), line 187 in "pquery.c" unnamed block in PortalRunMulti(portal = 0x0000000110115e20, isTopLevel = '^A', setHoldSnapshot = '\0', dest = 0x00000001102b7520, altdest = 0x00000001102b7520, completionTag = ""), line 1303 in "pquery.c" unnamed block in PortalRunMulti(portal = 0x0000000110115e20, isTopLevel = '^A', setHoldSnapshot = '\0', dest = 0x00000001102b7520, altdest = 0x00000001102b7520, completionTag = ""), line 1303 in "pquery.c" PortalRunMulti(portal = 0x0000000110115e20, isTopLevel = '^A', setHoldSnapshot = '\0', dest = 0x00000001102b7520, altdest = 0x00000001102b7520, completionTag = ""), line 1303 in "pquery.c" unnamed block in PortalRun(portal = 0x0000000110115e20, count = 9223372036854775807, isTopLevel = '^A', dest = 0x00000001102b7520, altdest = 0x00000001102b7520, completionTag = ""), line 815 in "pquery.c" PortalRun(portal = 0x0000000110115e20, count = 9223372036854775807, isTopLevel = '^A', dest = 0x00000001102b7520, altdest = 0x00000001102b7520, completionTag = ""), line 815 in "pquery.c" unnamed block in exec_simple_query(query_string = "UPDATE pgbench_tellers SET tbalance = tbalance + 4019 WHERE tid = 6409;"), line 1094 in "postgres.c" exec_simple_query(query_string = "UPDATE pgbench_tellers SET tbalance = tbalance + 4019 WHERE tid = 6409;"), line 1094 in "postgres.c" unnamed block in PostgresMain(argc = 1, argv = 0x0000000110119f68, dbname = "postgres", username = "postgres"), line 4076 in "postgres.c" PostgresMain(argc = 1, argv = 0x0000000110119f68, dbname = "postgres", username = "postgres"), line 4076 in "postgres.c"
BackendRun(port = 0x0000000110114290), line 4279 in "postmaster.c"
BackendStartup(port = 0x0000000110114290), line 3953 in "postmaster.c"
unnamed block in ServerLoop(), line 1701 in "postmaster.c"
unnamed block in ServerLoop(), line 1701 in "postmaster.c"
unnamed block in ServerLoop(), line 1701 in "postmaster.c"
ServerLoop(), line 1701 in "postmaster.c"
PostmasterMain(argc = 3, argv = 0x00000001100c6190), line 1309 in "postmaster.c"
main(argc = 3, argv = 0x00000001100c6190), line 228 in "main.c"


As I already mentioned, we built Postgres with LOCK_DEBUG , so we can inspect lock owner. Backend is waiting for itself!
Now please look at two frames in this stack trace marked with red.
XLogInsertRecord is setting WALInsert locks at the beginning of the function:

    if (isLogSwitch)
        WALInsertLockAcquireExclusive();
    else
        WALInsertLockAcquire();

WALInsertLockAcquire just selects random item from WALInsertLocks array and exclusively locks:

    if (lockToTry == -1)
        lockToTry = MyProc->pgprocno % NUM_XLOGINSERT_LOCKS;
    MyLockNo = lockToTry;
    immed = LWLockAcquire(&WALInsertLocks[MyLockNo].l.lock, LW_EXCLUSIVE);

Then, following the stack trace, AdvanceXLInsertBuffer calls WaitXLogInsertionsToFinish:

            /*
             * Now that we have an up-to-date LogwrtResult value, see if we
             * still need to write it or if someone else already did.
             */
            if (LogwrtResult.Write < OldPageRqstPtr)
            {
                /*
* Must acquire write lock. Release WALBufMappingLock first, * to make sure that all insertions that we need to wait for
                 * can finish (up to this same position). Otherwise we risk
                 * deadlock.
                 */
                LWLockRelease(WALBufMappingLock);

WaitXLogInsertionsToFinish(OldPageRqstPtr);

                LWLockAcquire(WALWriteLock, LW_EXCLUSIVE);


It releases WALBufMappingLock but not WAL insert locks!
Finally in WaitXLogInsertionsToFinish tries to wait for all locks:

    for (i = 0; i < NUM_XLOGINSERT_LOCKS; i++)
    {
        XLogRecPtr    insertingat = InvalidXLogRecPtr;

        do
        {
            /*
* See if this insertion is in progress. LWLockWait will wait for
             * the lock to be released, or for the 'value' to be set by a
* LWLockUpdateVar call. When a lock is initially acquired, its * value is 0 (InvalidXLogRecPtr), which means that we don't know
             * where it's inserting yet.  We will have to wait for it.  If
* it's a small insertion, the record will most likely fit on the * same page and the inserter will release the lock without ever
             * calling LWLockUpdateVar.  But if it has to sleep, it will
             * advertise the insertion point with LWLockUpdateVar before
             * sleeping.
             */
            if (LWLockWaitForVar(&WALInsertLocks[i].l.lock,
 &WALInsertLocks[i].l.insertingAt,
                                 insertingat, &insertingat))

And here we stuck!
The comment to WaitXLogInsertionsToFinish says:

 * Note: When you are about to write out WAL, you must call this function
 * *before* acquiring WALWriteLock, to avoid deadlocks. This function might
 * need to wait for an insertion to finish (or at least advance to next
* uninitialized page), and the inserter might need to evict an old WAL buffer
 * to make room for a new one, which in turn requires WALWriteLock.

Which contradicts to the observed stack trace.

I wonder if it is really synchronization bug in xlog.c or there is something wrong in this stack trace and it can not happen in case of normal work?

Thanks in advance,

--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Reply via email to