TruncateMultiXact() bugs

Heikki Linnakangas Fri, 14 Jun 2024 04:37:56 -0700

I was performing tests around multixid wraparound, when I ran into thisassertion:

TRAP: failed Assert("CritSectionCount == 0 || (context)->allowInCritSection"), File: 
"../src/backend/utils/mmgr/mcxt.c", Line: 1353, PID: 920981
postgres: autovacuum worker template0(ExceptionalCondition+0x6e)[0x560a501e866e]
postgres: autovacuum worker template0(+0x5dce3d)[0x560a50217e3d]
postgres: autovacuum worker template0(ForwardSyncRequest+0x8e)[0x560a4ffec95e]
postgres: autovacuum worker template0(RegisterSyncRequest+0x2b)[0x560a50091eeb]
postgres: autovacuum worker template0(+0x187b0a)[0x560a4fdc2b0a]
postgres: autovacuum worker template0(SlruDeleteSegment+0x101)[0x560a4fdc2ab1]
postgres: autovacuum worker template0(TruncateMultiXact+0x2fb)[0x560a4fdbde1b]
postgres: autovacuum worker 
template0(vac_update_datfrozenxid+0x4b3)[0x560a4febd2f3]
postgres: autovacuum worker template0(+0x3adf66)[0x560a4ffe8f66]
postgres: autovacuum worker template0(AutoVacWorkerMain+0x3ed)[0x560a4ffe7c2d]
postgres: autovacuum worker template0(+0x3b1ead)[0x560a4ffecead]
postgres: autovacuum worker template0(+0x3b620e)[0x560a4fff120e]
postgres: autovacuum worker template0(+0x3b3fbb)[0x560a4ffeefbb]
postgres: autovacuum worker template0(+0x2f724e)[0x560a4ff3224e]
/lib/x86_64-linux-gnu/libc.so.6(+0x27c8a)[0x7f62cc642c8a]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x85)[0x7f62cc642d45]
postgres: autovacuum worker template0(_start+0x21)[0x560a4fd16f31]
2024-06-14 13:11:02.025 EEST [920971] LOG:  server process (PID 920981) was 
terminated by signal 6: Aborted
2024-06-14 13:11:02.025 EEST [920971] DETAIL:  Failed process was running: 
autovacuum: VACUUM pg_toast.pg_toast_13407 (to prevent wraparound)

The attached python script reproduces this pretty reliably. It's areduced version of a larger test script I was working on, it probablycould be simplified further for this particular issue.


Looking at the code, it's pretty clear how it happens:

1. TruncateMultiXact does START_CRIT_SECTION();

2. In the critical section, it calls PerformMembersTruncation() ->SlruDeleteSegment() -> SlruInternalDeleteSegment() ->RegisterSyncRequest() -> ForwardSyncRequest()

3. If the fsync request queue is full, it callsCompactCheckpointerRequestQueue(), which calls palloc0. Pallocs are notallowed in a critical section.

A straightforward fix is to add a check toCompactCheckpointerRequestQueue() to bail out without compacting, ifit's called in a critical section. That would cover any other cases likethis, where RegisterSyncRequest() is called in a critical section. Ihaven't tried searching if any more cases like this exist.


But wait there is more!

After applying that fix in CompactCheckpointerRequestQueue(), the testscript often gets stuck. There's a deadlock between the checkpointer,and the autovacuum backend trimming the SLRUs:


1. TruncateMultiXact does this:

        MyProc->delayChkptFlags |= DELAY_CHKPT_START;

2. It then makes that call to PerformMembersTruncation() andRegisterSyncRequest(). If it cannot queue the request, it sleeps alittle and retries. But the checkpointer is stuck waiting for theautovacuum backend, because of delayChkptFlags, and will never clear thequeue.

To fix, I propose to add AbsorbSyncRequests() calls to the wait-loops inCreateCheckPoint().



Attached patch fixes both of those issues.

I can't help thinking that TruncateMultiXact() should perhaps not havesuch a long critical section. TruncateCLOG() doesn't do that. But it wasadded for good reasons in commit 4f627f897367, and this fix seemsappropriate for the stable branches anyway, even if we come up withsomething better for master.


--
Heikki Linnakangas
Neon (https://neon.tech)

import time
import logging
import psycopg2
from typing import Any, Callable, Dict, Iterator, List, Optional, Tuple, Type, Union, cast
import subprocess

PGDATA = "pgdata-multixid-repro"

log = logging.getLogger(__name__)

def connect():
    return psycopg2.connect(
        database="postgres",
        user="postgres",
        host="/tmp/",
    )

# Constants and macros copied from PostgreSQL multixact.c and headers. These are needed to
# calculate the SLRU segments that a particular multixid or multixid-offsets falls into.
BLCKSZ = 8192
MULTIXACT_OFFSETS_PER_PAGE = int(BLCKSZ / 4)
SLRU_PAGES_PER_SEGMENT = int(32)
MXACT_MEMBER_BITS_PER_XACT = 8
MXACT_MEMBER_FLAGS_PER_BYTE = 1
MULTIXACT_FLAGBYTES_PER_GROUP = 4
MULTIXACT_MEMBERS_PER_MEMBERGROUP = MULTIXACT_FLAGBYTES_PER_GROUP * MXACT_MEMBER_FLAGS_PER_BYTE
MULTIXACT_MEMBERGROUP_SIZE = 4 * MULTIXACT_MEMBERS_PER_MEMBERGROUP + MULTIXACT_FLAGBYTES_PER_GROUP
MULTIXACT_MEMBERGROUPS_PER_PAGE = int(BLCKSZ / MULTIXACT_MEMBERGROUP_SIZE)
MULTIXACT_MEMBERS_PER_PAGE = MULTIXACT_MEMBERGROUPS_PER_PAGE * MULTIXACT_MEMBERS_PER_MEMBERGROUP

def MultiXactIdToOffsetSegment(xid: int):
    return int(xid / (SLRU_PAGES_PER_SEGMENT * MULTIXACT_OFFSETS_PER_PAGE))


def MXOffsetToMemberSegment(off: int):
    return int(off / (SLRU_PAGES_PER_SEGMENT * MULTIXACT_MEMBERS_PER_PAGE))

def advance_multixid_to(next_multi_xid: int, next_multi_offset: int
):
    """
    Use pg_resetwal to advance the nextMulti and nextMultiOffset values in a stand-alone
    Postgres cluster. This is useful to get close to wraparound or some other interesting
    value, without having to burn a lot of time consuming the (multi-)XIDs one by one.

    The new values should be higher than the old ones, in a wraparound-aware sense.

    On entry, the server should be running. It will be shut down and restarted.
    """

    # Read old values from the last checkpoint. We will pass the old oldestMultiXid value
    # back to pg_resetwal, there's no option to leave it alone.
    with connect() as conn:
        with conn.cursor() as cur:
            # Make sure the oldest-multi-xid value in the control file is up-to-date
            cur.execute("checkpoint")
            cur.execute("select oldest_multi_xid, next_multixact_id from pg_control_checkpoint()")
            (ckpt_oldest_multi_xid, ckpt_next_multi_xid) = cur.fetchone()
    log.info(f"oldestMultiXid was {ckpt_oldest_multi_xid}, nextMultiXid was {ckpt_next_multi_xid}")
    log.info(f"Resetting to {next_multi_xid}")

    # Use pg_resetwal to reset the next multiXid and multiOffset to given values.
    subprocess.check_call(["pg_ctl", "-D", PGDATA, "stop"])
    cmd = [
        "pg_resetwal",
        f"--multixact-ids={next_multi_xid},{ckpt_oldest_multi_xid}",
        f"--multixact-offset={next_multi_offset}",
        "-D",
        PGDATA,
    ]
    subprocess.check_call(cmd)

    # Because we skip over a lot of values, Postgres hasn't created the SLRU segments for
    # the new values yet. Create them manually, to allow Postgres to start up.
    #
    # This leaves "gaps" in the SLRU where segments between old value and new value are
    # missing. That's OK for our purposes. Autovacuum will print some warnings about the
    # missing segments, but will clean it up by truncating the SLRUs up to the new value,
    # closing the gap.
    segname = "%04X" % MultiXactIdToOffsetSegment(next_multi_xid)
    log.info(f"Creating dummy segment pg_multixact/offsets/{segname}")
    with open(f"{PGDATA}/pg_multixact/offsets/{segname}", "w") as of:
        of.write("\0" * SLRU_PAGES_PER_SEGMENT * BLCKSZ)
        of.flush()

    segname = "%04X" % MXOffsetToMemberSegment(next_multi_offset)
    log.info(f"Creating dummy segment pg_multixact/members/{segname}")
    with open(f"{PGDATA}/pg_multixact/members/{segname}", "w") as of:
        of.write("\0" * SLRU_PAGES_PER_SEGMENT * BLCKSZ)
        of.flush()

    # Start Postgres again and wait until autovacuum has processed all the databases
    #
    # This allows truncating the SLRUs, fixing the gaps with missing segments.
    subprocess.check_call(["pg_ctl", "-D", PGDATA, "start"])
    with connect().cursor() as cur:
        for _ in range(1000):
            cur.execute("select min(datminmxid::text::int8) from pg_database")
            datminmxid = int(cur.fetchall()[0][0])
            log.info(f"datminmxid {datminmxid}")
            if next_multi_xid - datminmxid < 1_000_000:  # not wraparound-aware!
                break
            time.sleep(0.5)


def main():
    # In order to to test multixid wraparound, we need to first advance the counter to
    # within spitting distance of the wraparound, that is 2^32 multi-XIDs. We could simply
    # run a workload that consumes a lot of multi-XIDs until we approach that, but that
    # takes a very long time. So we cheat.
    #
    # Our strategy is to create a Postgres cluster, and use pg_resetwal to
    # set the multi-xid counter a higher value. However, we cannot directly set
    # it to just before 2^32 (~ 4 billion), because that would make the exisitng
    # 'relminmxid' values to look like they're in the future. It's not clear how the
    # system would behave in that situation. So instead, we bump it up ~ 1 billion
    # multi-XIDs at a time, and let autovacuum to process all the relations and update
    # 'relminmxid' between each run.
    subprocess.check_call(["initdb", "-D", PGDATA, "-U", "postgres"])
    with open(f"{PGDATA}/postgresql.conf", "a") as file1:
        file1.writelines([
            "log_autovacuum_min_duration = 0\n",
            # Perform anti-wraparound vacuuming aggressively
            "autovacuum_naptime='1 s'\n",
            "autovacuum_freeze_max_age = 1000000\n",
            "autovacuum_multixact_freeze_max_age = 1000000\n",
            "shared_buffers='1 MB'",
        ])

    subprocess.check_call(["pg_ctl", "-D", PGDATA, "start"])
    advance_multixid_to(0x40000000, 0x10000000)
    advance_multixid_to(0x80000000, 0x20000000)
    advance_multixid_to(0xC0000000, 0x30000000)
    advance_multixid_to(0xFFFFFF00, 0xFFFFFF00)

if __name__ == "__main__":
    main()

From abe096fb960106410baaa84dc4ba96ba5b37c6d0 Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas <heikki.linnakan...@iki.fi>
Date: Fri, 14 Jun 2024 14:20:58 +0300
Subject: [PATCH 1/1] Fix bugs in MultiXact truncation

1. TruncateMultiXact() performs the SLRU truncations in a critical
section. Deleting the SLRU segments calls ForwardSyncRequest(), which
will try to compact the request queue if it's full
(CompactCheckpointerRequestQueue()). That in turn allocates memory,
which is not allowed in a critical section. Backtrace:

    TRAP: failed Assert("CritSectionCount == 0 || (context)->allowInCritSection"), File: "../src/backend/utils/mmgr/mcxt.c", Line: 1353, PID: 920981
    postgres: autovacuum worker template0(ExceptionalCondition+0x6e)[0x560a501e866e]
    postgres: autovacuum worker template0(+0x5dce3d)[0x560a50217e3d]
    postgres: autovacuum worker template0(ForwardSyncRequest+0x8e)[0x560a4ffec95e]
    postgres: autovacuum worker template0(RegisterSyncRequest+0x2b)[0x560a50091eeb]
    postgres: autovacuum worker template0(+0x187b0a)[0x560a4fdc2b0a]
    postgres: autovacuum worker template0(SlruDeleteSegment+0x101)[0x560a4fdc2ab1]
    postgres: autovacuum worker template0(TruncateMultiXact+0x2fb)[0x560a4fdbde1b]
    postgres: autovacuum worker template0(vac_update_datfrozenxid+0x4b3)[0x560a4febd2f3]
    postgres: autovacuum worker template0(+0x3adf66)[0x560a4ffe8f66]
    postgres: autovacuum worker template0(AutoVacWorkerMain+0x3ed)[0x560a4ffe7c2d]
    postgres: autovacuum worker template0(+0x3b1ead)[0x560a4ffecead]
    postgres: autovacuum worker template0(+0x3b620e)[0x560a4fff120e]
    postgres: autovacuum worker template0(+0x3b3fbb)[0x560a4ffeefbb]
    postgres: autovacuum worker template0(+0x2f724e)[0x560a4ff3224e]
    /lib/x86_64-linux-gnu/libc.so.6(+0x27c8a)[0x7f62cc642c8a]
    /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x85)[0x7f62cc642d45]
    postgres: autovacuum worker template0(_start+0x21)[0x560a4fd16f31]

To fix, bail out in CompactCheckpointerRequestQueue() without doing
anything, if it's called in a critical section. That covers the above
call path, as well as any other similar cases where
RegisterSyncRequest might be called in a critical section.

2. After fixing that, another problem became apparent: Autovacuum
process doing that truncation can deadlock with the checkpointer
process. TruncateMultiXact() sets "MyProc->delayChkptFlags |=
DELAY_CHKPT_START". If the sync request queue is full and cannot be
compacted, the process will repeatedly sleep and retry, until there is
room in the queue. However, if the checkpointer is trying to start a
checkpoint at the same time, and is waiting for the DELAY_CHKPT_START
processes to finish, the queue will never shrink.

More concretely, the autovacuum process is stuck here:

    #0  0x00007fc934926dc3 in epoll_wait () from /lib/x86_64-linux-gnu/libc.so.6
    #1  0x000056220b24348b in WaitEventSetWaitBlock (set=0x56220c2e4b50, occurred_events=0x7ffe7856d040, nevents=1, cur_timeout=<optimized out>) at ../src/backend/storage/ipc/latch.c:1570
    #2  WaitEventSetWait (set=0x56220c2e4b50, timeout=timeout@entry=10, occurred_events=<optimized out>, occurred_events@entry=0x7ffe7856d040, nevents=nevents@entry=1,
        wait_event_info=wait_event_info@entry=150994949) at ../src/backend/storage/ipc/latch.c:1516
    #3  0x000056220b243224 in WaitLatch (latch=<optimized out>, latch@entry=0x0, wakeEvents=wakeEvents@entry=40, timeout=timeout@entry=10, wait_event_info=wait_event_info@entry=150994949)
        at ../src/backend/storage/ipc/latch.c:538
    #4  0x000056220b26cf46 in RegisterSyncRequest (ftag=ftag@entry=0x7ffe7856d0a0, type=type@entry=SYNC_FORGET_REQUEST, retryOnError=true) at ../src/backend/storage/sync/sync.c:614
    #5  0x000056220af9db0a in SlruInternalDeleteSegment (ctl=ctl@entry=0x56220b7beb60 <MultiXactMemberCtlData>, segno=segno@entry=11350) at ../src/backend/access/transam/slru.c:1495
    #6  0x000056220af9dab1 in SlruDeleteSegment (ctl=ctl@entry=0x56220b7beb60 <MultiXactMemberCtlData>, segno=segno@entry=11350) at ../src/backend/access/transam/slru.c:1566
    #7  0x000056220af98e1b in PerformMembersTruncation (oldestOffset=<optimized out>, newOldestOffset=<optimized out>) at ../src/backend/access/transam/multixact.c:3006
    #8  TruncateMultiXact (newOldestMulti=newOldestMulti@entry=3221225472, newOldestMultiDB=newOldestMultiDB@entry=4) at ../src/backend/access/transam/multixact.c:3201
    #9  0x000056220b098303 in vac_truncate_clog (frozenXID=749, minMulti=<optimized out>, lastSaneFrozenXid=749, lastSaneMinMulti=3221225472) at ../src/backend/commands/vacuum.c:1917
    #10 vac_update_datfrozenxid () at ../src/backend/commands/vacuum.c:1760
    #11 0x000056220b1c3f76 in do_autovacuum () at ../src/backend/postmaster/autovacuum.c:2550
    #12 0x000056220b1c2c3d in AutoVacWorkerMain (startup_data=<optimized out>, startup_data_len=<optimized out>) at ../src/backend/postmaster/autovacuum.c:1569

and the checkpointer is stuck here:

    #0  0x00007fc9348ebf93 in clock_nanosleep () from /lib/x86_64-linux-gnu/libc.so.6
    #1  0x00007fc9348fe353 in nanosleep () from /lib/x86_64-linux-gnu/libc.so.6
    #2  0x000056220b40ecb4 in pg_usleep (microsec=microsec@entry=10000) at ../src/port/pgsleep.c:50
    #3  0x000056220afb43c3 in CreateCheckPoint (flags=flags@entry=108) at ../src/backend/access/transam/xlog.c:7098
    #4  0x000056220b1c6e86 in CheckpointerMain (startup_data=<optimized out>, startup_data_len=<optimized out>) at ../src/backend/postmaster/checkpointer.c:464

To fix, add AbsorbSyncRequests() to the loops where the checkpointer
waits for DELAY_CHKPT_START or DELAY_CHKPT_COMPLETE operations to
finish.
---
 src/backend/access/transam/xlog.c     | 9 +++++++++
 src/backend/postmaster/checkpointer.c | 4 ++++
 2 files changed, 13 insertions(+)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 330e058c5f..8dcdf5a764 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -7094,6 +7094,13 @@ CreateCheckPoint(int flags)
 	{
 		do
 		{
+			/*
+			 * Keep absorbing fsync requests while we wait. There could even
+			 * be a deadlock if we don't, if the process that prevents the
+			 * checkpoint is trying to add a request to the queue.
+			 */
+			AbsorbSyncRequests();
+
 			pgstat_report_wait_start(WAIT_EVENT_CHECKPOINT_DELAY_START);
 			pg_usleep(10000L);	/* wait for 10 msec */
 			pgstat_report_wait_end();
@@ -7109,6 +7116,8 @@ CreateCheckPoint(int flags)
 	{
 		do
 		{
+			AbsorbSyncRequests();
+
 			pgstat_report_wait_start(WAIT_EVENT_CHECKPOINT_DELAY_COMPLETE);
 			pg_usleep(10000L);	/* wait for 10 msec */
 			pgstat_report_wait_end();
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index 3c68a9904d..199f008bcd 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -1169,6 +1169,10 @@ CompactCheckpointerRequestQueue(void)
 	/* must hold CheckpointerCommLock in exclusive mode */
 	Assert(LWLockHeldByMe(CheckpointerCommLock));
 
+	/* Avoid memory allocations in a critical section. */
+	if (CritSectionCount > 0)
+		return false;
+
 	/* Initialize skip_slot array */
 	skip_slot = palloc0(sizeof(bool) * CheckpointerShmem->num_requests);
 
-- 
2.39.2

TruncateMultiXact() bugs

Reply via email to