HHi,

On 2022-07-01 13:14:23 -0700, Andres Freund wrote:
> I saw a couple failures of 031 on CI for the meson patch - which obviously
> doesn't change anything around this. However it adds a lot more distributions,
> and the added ones run in docker containers on a shared host, presumably
> adding a lot of noise.  I saw this more frequently when I accidentally had the
> test runs at the number of CPUs in the host, rather than the allotted CPUs in
> the container.
>
> That made me look more into these issues. I played with adding a pg_usleep()
> to RecoveryConflictInterrupt() to simulate slow signal delivery.
>
> Found a couple things:
>
> - the pg_usleep(5000) in ResolveRecoveryConflictWithVirtualXIDs() can
>   completely swamp the target(s) on a busy system. This surely is exascerbated
>   by the usleep I added RecoveryConflictInterrupt() but a 5ms signalling pace
>   does seem like a bad idea.

This one is also implicated in
https://postgr.es/m/20220701154105.jjfutmngoedgiad3%40alvherre.pgsql
and related issues.

Besides being very short, it also seems like a bad idea to wait when we might
not need to? Seems we should only wait if we subsequently couldn't get the
lock?

Misleadingly WaitExceedsMaxStandbyDelay() also contains a usleep, which at
least I wouldn't expect given its name.


A minimal fix would be to increase the wait time, similar how it is done with
standbyWait_us?

Medium term it seems we ought to set the startup process's latch when handling
a conflict, and use a latch wait. But avoiding races probably isn't quite
trivial.


> - we process the same recovery conflict (not a single signal, but a single
>   "real conflict") multiple times in the target of a conflict, presumably
>   while handling the error. That includes handling the same interrupt once as
>   an ERROR and once as FATAL.
>
>   E.g.
>
> 2022-07-01 12:19:14.428 PDT [2000572] LOG:  recovery still waiting after 
> 10.032 ms: recovery conflict on buffer pin
> 2022-07-01 12:19:14.428 PDT [2000572] CONTEXT:  WAL redo at 0/344E098 for 
> Heap2/PRUNE: latestRemovedXid 0 nredirected 0 ndead 100; blkref #0: rel 
> 1663/16385/1>
> 2022-07-01 12:19:54.597 PDT [2000578] 031_recovery_conflict.pl ERROR:  
> canceling statement due to conflict with recovery at character 15
> 2022-07-01 12:19:54.597 PDT [2000578] 031_recovery_conflict.pl DETAIL:  User 
> transaction caused buffer deadlock with recovery.
> 2022-07-01 12:19:54.597 PDT [2000578] 031_recovery_conflict.pl STATEMENT:  
> SELECT * FROM test_recovery_conflict_table2;
> 2022-07-01 12:19:54.778 PDT [2000572] LOG:  recovery finished waiting after 
> 40359.937 ms: recovery conflict on buffer pin
> 2022-07-01 12:19:54.778 PDT [2000572] CONTEXT:  WAL redo at 0/344E098 for 
> Heap2/PRUNE: latestRemovedXid 0 nredirected 0 ndead 100; blkref #0: rel 
> 1663/16385/1>
> 2022-07-01 12:19:54.788 PDT [2000578] 031_recovery_conflict.pl FATAL:  
> terminating connection due to conflict with recovery
> 2022-07-01 12:19:54.788 PDT [2000578] 031_recovery_conflict.pl DETAIL:  User 
> transaction caused buffer deadlock with recovery.
> 2022-07-01 12:19:54.788 PDT [2000578] 031_recovery_conflict.pl HINT:  In a 
> moment you should be able to reconnect to the database and repeat your 
> command.
> 2022-07-01 12:19:54.837 PDT [2001389] 031_recovery_conflict.pl LOG:  
> statement: SELECT 1;
>
>   note that the startup process considers the conflict resolved by the time
>   the backend handles the interrupt.

I guess the reason we first get an ERROR and then a FATAL is that the second
iteration hits the if (RecoveryConflictPending && DoingCommandRead) bit,
because we end up there after handling the first error? And that's a FATAL.

I suspect that Thomas' fix will address at least part of this, as the check
whether we're still waiting for a lock will be made just before the error is
thrown.


>   I also see cases where a FATAL is repeated:
>
> 2022-07-01 12:43:18.190 PDT [2054721] LOG:  recovery still waiting after 
> 15.410 ms: recovery conflict on snapshot
> 2022-07-01 12:43:18.190 PDT [2054721] DETAIL:  Conflicting process: 2054753.
> 2022-07-01 12:43:18.190 PDT [2054721] CONTEXT:  WAL redo at 0/344AB90 for 
> Heap2/PRUNE: latestRemovedXid 732 nredirected 18 ndead 0; blkref #0: rel 
> 1663/16385/>
> 2054753: reporting recovery conflict 9
> 2022-07-01 12:43:18.482 PDT [2054753] 031_recovery_conflict.pl FATAL:  
> terminating connection due to conflict with recovery
> 2022-07-01 12:43:18.482 PDT [2054753] 031_recovery_conflict.pl DETAIL:  User 
> query might have needed to see row versions that must be removed.
> 2022-07-01 12:43:18.482 PDT [2054753] 031_recovery_conflict.pl HINT:  In a 
> moment you should be able to reconnect to the database and repeat your 
> command.
> ...
> 2054753: reporting recovery conflict 9
> 2022-07-01 12:43:19.068 PDT [2054753] 031_recovery_conflict.pl FATAL:  
> terminating connection due to conflict with recovery
> 2022-07-01 12:43:19.068 PDT [2054753] 031_recovery_conflict.pl DETAIL:  User 
> query might have needed to see row versions that must be removed.
> 2022-07-01 12:43:19.068 PDT [2054753] 031_recovery_conflict.pl HINT:  In a 
> moment you should be able to reconnect to the database and repeat your 
> command.
>
>   the FATAL one seems like it might at least partially be due to
>   RecoveryConflictPending not being reset in at least some of the FATAL
>   recovery conflict paths.
>
>   It seems pretty obvious that the proc_exit_inprogress check in
>   RecoveryConflictInterrupt() is misplaced, and needs to be where the errors
>   are thrown. But that won't help, because it turns out, we don't yet set that
>   necessarily.
>
>   Look at this stack from an assertion in ProcessInterrupts() ensuring that
>   the same FATAL isn't raised twice:
>
> #0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:49
> #1  0x00007fd47897b546 in __GI_abort () at abort.c:79
> #2  0x00005594c150b27a in ExceptionalCondition (conditionName=0x5594c16fe746 
> "!in_fatal", errorType=0x5594c16fd8f6 "FailedAssertion",
>     fileName=0x5594c16fdac0 
> "/home/andres/src/postgresql/src/backend/tcop/postgres.c", lineNumber=3259)
>     at /home/andres/src/postgresql/src/backend/utils/error/assert.c:69
> #3  0x00005594c134f6d2 in ProcessInterrupts () at 
> /home/andres/src/postgresql/src/backend/tcop/postgres.c:3259
> #4  0x00005594c150c671 in errfinish (filename=0x5594c16b8f2e "pqcomm.c", 
> lineno=1393, funcname=0x5594c16b95e0 <__func__.8> "internal_flush")
>     at /home/andres/src/postgresql/src/backend/utils/error/elog.c:683
> #5  0x00005594c115e059 in internal_flush () at 
> /home/andres/src/postgresql/src/backend/libpq/pqcomm.c:1393
> #6  0x00005594c115df49 in socket_flush () at 
> /home/andres/src/postgresql/src/backend/libpq/pqcomm.c:1340
> #7  0x00005594c15121af in send_message_to_frontend (edata=0x5594c18a5740 
> <errordata>) at 
> /home/andres/src/postgresql/src/backend/utils/error/elog.c:3283
> #8  0x00005594c150f00e in EmitErrorReport () at 
> /home/andres/src/postgresql/src/backend/utils/error/elog.c:1541
> #9  0x00005594c150c42e in errfinish (filename=0x5594c16fdaed "postgres.c", 
> lineno=3266, funcname=0x5594c16ff5b0 <__func__.9> "ProcessInterrupts")
>     at /home/andres/src/postgresql/src/backend/utils/error/elog.c:592
> #10 0x00005594c134f770 in ProcessInterrupts () at 
> /home/andres/src/postgresql/src/backend/tcop/postgres.c:3266
> #11 0x00005594c134b995 in ProcessClientReadInterrupt (blocked=true) at 
> /home/andres/src/postgresql/src/backend/tcop/postgres.c:497
> #12 0x00005594c1153417 in secure_read (port=0x5594c2e7d620, 
> ptr=0x5594c189ba60 <PqRecvBuffer>, len=8192)
>
>   reporting a FATAL error in process of reporting a FATAL error. Yeha.
>
>   I assume this could lead to sending out the same message quite a few
>   times.

This seems like it needs to be fixed in elog.c. ISTM that at the very least we
ought to HOLD_INTERRUPTS() before the EmitErrorReport() for FATAL.

Greetings,

Andres Freund


Reply via email to