On Thu, Dec 29, 2022 at 9:40 PM Noah Misch <n...@leadboat.com> wrote: > On Tue, Jul 26, 2022 at 07:22:52PM -0400, Tom Lane wrote: > > Thomas Munro <thomas.mu...@gmail.com> writes: > > > ... The regular expression machinery is capable of > > > consuming a lot of CPU, and does CANCEL_REQUESTED(nfa->v->re) > > > frequently to avoid getting stuck. With the patch as it stands, that > > > would never be true. > > > > Surely that can't be too hard to fix. We might have to refactor > > the code around QueryCancelPending a little bit so that callers > > can ask "do we need a query cancel now?" without actually triggering > > a longjmp ... but why would that be problematic? > > It could work. The problems are like those of making code safe to run in a > signal handler. You can use e.g. snprintf in rcancelrequested(), but you > still can't use palloc() or ereport(). I see at least these strategies: > > 1. Accept that recovery conflict checks run after a regex call completes. > 2. Have rcancelrequested() return true unconditionally if we need a conflict > check. If there's no actual conflict, restart the regex. > 3. Have rcancelrequested() run the conflict check, including elog-using > PostgreSQL code. On longjmp(), accept the leak of regex mallocs. > 4. Have rcancelrequested() run the conflict check, including elog-using > PostgreSQL code. On longjmp(), escalate to FATAL. > 5. Write the conflict check code to dutifully avoid longjmp(). > 6. Convert src/backend/regex to use palloc, so longjmp() is fine.
Thanks! I appreciate the help getting unstuck here. I'd thought about some of these but not all. I also considered a couple more: 7. Do a CFI() in a try/catch if INTERRUPTS_PENDING_CONDITION() is true, and copy the error somewhere to be re-thrown later after the regexp code exits with REG_CANCEL. 8. Do a CFI() in a try/catch if INTERRUPTS_PENDING_CONDITION() is true, and call a new regexp function that will free everything before re-throwing. After Tom's response I spent some time trying to figure out how to make a SOFT_CHECK_FOR_INTERRUPTS(), which would return a value to indicate that it would like to throw. I think it would need to re-arm various flags and introduce a programming rule for all interrupt processing routines that if they fired once under a soft check they must fire again later under a non-soft check. That all seems a bit complicated, and a general mechanism like that seemed like overkill for a single user, which led me to idea #7. Idea #8 is a realisation that twisting oneself into a pretzel to avoid having to change the regexp code or its REG_CANCEL control flow may be a bit silly. If the only thing it really needs to do is free some memory, maybe the regexp module should provide a function that frees everything that is safe to call from our rcancelrequested callback, so we can do so before we longjmp back to Kansas. Then the REG_CANCEL code paths would be effectively unreachable in PostgreSQL. I don't know if it's better or worse than your idea #6, "use palloc instead, it already has garbage collection, duh", but it's a different take on the same realisation that this is just about free(). I guess idea #6 must be pretty easy to try: just point that MALLOC() macro to palloc(), and do a plain old CFI() in rcancelrequested(). Why do you suggest #3 as an interim measure? Do we fear that palloc() might hurt regexp performance? > I can reproduce that symptom reliably, on GNU/Linux, with the attached patch > adding sleeps. The key log bit: > > 2022-09-16 11:50:37.338 CEST [15022:4] 031_recovery_conflict.pl LOG: > statement: BEGIN; > 2022-09-16 11:50:37.339 CEST [15022:5] 031_recovery_conflict.pl LOG: > statement: LOCK TABLE test_recovery_conflict_table1 IN ACCESS SHARE MODE; > 2022-09-16 11:50:37.341 CEST [15022:6] 031_recovery_conflict.pl LOG: > statement: SELECT 1; > 2022-09-16 11:50:38.076 CEST [14880:17] LOG: recovery still waiting after > 11.482 ms: recovery conflict on lock > 2022-09-16 11:50:38.076 CEST [14880:18] DETAIL: Conflicting process: 15022. > 2022-09-16 11:50:38.076 CEST [14880:19] CONTEXT: WAL redo at 0/34243F0 for > Standby/LOCK: xid 733 db 16385 rel 16386 > 2022-09-16 11:50:38.196 CEST [15022:7] 031_recovery_conflict.pl FATAL: > terminating connection due to conflict with recovery > 2022-09-16 11:50:38.196 CEST [15022:8] 031_recovery_conflict.pl DETAIL: User > transaction caused buffer deadlock with recovery. > 2022-09-16 11:50:38.196 CEST [15022:9] 031_recovery_conflict.pl HINT: In a > moment you should be able to reconnect to the database and repeat your > command. > 2022-09-16 11:50:38.197 CEST [15022:10] 031_recovery_conflict.pl LOG: > disconnection: session time: 0:00:01.041 user=nm database=test_db host=[local] > 2022-09-16 11:50:38.198 CEST [14880:20] LOG: recovery finished waiting after > 132.886 ms: recovery conflict on lock > > The second DETAIL should be "User was holding a relation lock for too long." > The backend in question is idle in transaction. RecoveryConflictInterrupt() > for PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK won't see IsWaitingForLock(), > so it will find no conflict. However, RecoveryConflictReason remains > clobbered, hence the wrong DETAIL message. Aha. I'd speculated that RecoveryConflictReason must be capable of reporting bogus errors like that up-thread. > Incidentally, the affected test > contains comment "# DROP TABLE containing block which standby has in a pinned > buffer". The standby holds no pin at that moment; the LOCK TABLE pins system > catalog pages, but it drops every pin it acquires. Oh, I guess the comment is just wrong? There are earlier sections concerned with buffer pins, but the section "RECOVERY CONFLICT 3" is about locks.