回复：Bug in MultiXact replay compat logic for older minor version after crash-recovery

段坤仁(刻韧) Fri, 20 Mar 2026 00:47:37 -0700
On 19/03/2026 20:11, Heikki Linnakangas wrote:> This fix is not correct. The 
CREATE_ID records can appear in the WAL
> out of order, e.g:
>
> CREATE_ID:N+2 -> CREATE_ID:N+3 -> CREATE_ID:N+1
>
> With the patch, replaying the CREATE_ID:N+1 record would overwrite
> the changes made by the CREATE_ID:N+2 and CREATE_ID:N+3 records.
> Or perhaps track last-zeroed page separately from
> latest_page_number, and if we haven't seen any
> XLOG_MULTIXACT_ZERO_OFF_PAGE records yet after startup, call
> SimpleLruDoesPhysicalPageExist() to determine if initialization
> is needed. Attached patch does that.
    
Thanks for the quick reply and the fix! My two patches were both
stuck in the trap of trying to fix latest_page_number itself,
which as you pointed out is incorrect. Your approach of stepping
back and tracking last_initialized_offsets_page separately, with
SimpleLruDoesPhysicalPageExist() as the fallback, is much cleaner.
I had considered using SimpleLruDoesPhysicalPageExist() myself but
dismissed it as too expensive -- looking at it now, it's probably
the necessary path for handling unknown state after restart.
    
I've applied your patch on both REL_17_STABLE and REL_18_STABLE
and verified it fixes the crash-restart scenario. The same code
pattern exists on REL_14_STABLE through REL_16_STABLE, so it
should apply there as well. It would be good to have another
pair of eyes review the fix across all affected branches.
    
One minor observation on the patch: the comment for
last_initialized_offsets_page currently says:
> last_initialized_offsets_page is the XLOG_MULTIXACT_ZERO_OFF_PAGE
> record that we saw during WAL replay
But the variable is also updated when the compat logic in
RecordNewMultiXact() initializes a page itself. So it tracks the
last page zeroed during replay from any source, not just from
ZERO_OFF_PAGE records. Perhaps something like:
  /*
   * last_initialized_offsets_page tracks the last offsets page that
   * was zeroed during WAL replay, whether by replaying an
   * XLOG_MULTIXACT_ZERO_OFF_PAGE record or by the compat logic in
   * RecordNewMultiXact(). -1 if no page has been zeroed yet since
   * the last restart.
   */
    
== Maybe add a TAP test? ==
The compat logic in RecordNewMultiXact() has now been the source
of three bugs in a short period: it was originally introduced as
an intermediate step in [0], then broken by TRUNCATE replay [1],
and now by the crash-restart scenario described in this thread.
The code sits at a tricky intersection of WAL ordering, SLRU
page management, and crash recovery -- future changes to SLRU
initialization, checkpoint logic, or multixact lifecycle could
potentially re-introduce a regression. A targeted regression
test would help catch such issues early, especially since the
failure mode (standby crash loop) is severe and hard to diagnose
in production.
    
The compat logic now has two branches:
> if (last_initialized_offsets_page == -1)
>     init_needed = !SimpleLruDoesPhysicalPageExist(..., next_pageno);
> else
>     init_needed = (last_initialized_offsets_page == pageno);
    
The first branch (fallback path) fires after crash-restart,
when last_initialized_offsets_page is -1 because no
ZERO_OFF_PAGE has been replayed yet. I wrote a TAP test that
exercises this path end-to-end: crash-restart the standby,
truncate the offset file to remove the next page, then verify
the compat logic detects the missing page via
SimpleLruDoesPhysicalPageExist() and initializes it so the
standby can continue replay. This ensures the fallback path
works as a safety net when the tracking state is unknown.
    
The second branch (fast path) relies on
last_initialized_offsets_page advancing monotonically. To
guard against future code that might modify this variable
elsewhere and break the assumption, we could add Asserts at
the two write sites:
  /* at each write site: */
  Assert(last_initialized_offsets_page == -1 ||
         MultiXactOffsetPagePrecedes(
             last_initialized_offsets_page, new_page));
This ensures new_page is always ahead of
last_initialized_offsets_page, even across wraparound.
    
The test reproduces the fallback scenario by creating the
crash-restart condition where last_initialized_offsets_page
is -1 and the next offsets page is physically missing:
1. Using pg_resetwal -m 2047,1 to place nextMulti at the last
   entry on offset page 0, so the next allocation crosses a
   page boundary.
2. Using an injection point to pause after GetNewMultiXactId()
   but before XLogInsert(CREATE_ID), creating the window where
   CHECKPOINT captures the advanced nextMulti.
3. Triggering CHECKPOINT during the pause, waiting for the
   standby to replay it and do a restartpoint.
4. Crashing the standby.
5. Truncating the standby's offset file to remove the next page,
   simulating the old-version condition where ZERO_OFF_PAGE was
   never written. This is necessary because on a new-version
   primary, ZERO_OFF_PAGE is always written before CREATE_ID,
   so the page would normally exist on disk.
6. Waking up the injection point so CREATE_ID gets written.
7. Restarting the standby and verifying the compat logic
   initializes the missing page.
    
Without your fix applied, the test correctly detects the bug.
I've implemented this in attached patches, one for REL_17_STABLE and
one for REL_18_STABLE. I'm still looking for a way to add test
coverage on REL_14_STABLE through REL_16_STABLE, where the
injection point infrastructure is not available.
Would appreciate any comments on the test approach or the patches.
    
Regards
Duan
0001-Add-multixact-backwards-compatibility-test-REL_17.patch
Description: Binary data
0002-Add-multixact-backwards-compatibility-test-REL_18.patch
Description: Binary data
回复：Bug in MultiXact replay compat logic for older minor version after crash-recovery

Reply via email to