On Wed, 13 Mar 2024 at 04:53, Tom Lane <t...@sss.pgh.pa.us> wrote: > I suspect it's basically just a > timing dependency. Have you thought about the fact that a cancel > request is a no-op if it arrives after the query's done?
I agree it's probably a timing issue. The cancel being received after the query is done seems very unlikely, since the query takes 180 seconds (assuming PG_TEST_TIMEOUT_DEFAULT is not lowered for these animals). I think it's more likely that the cancel request arrives too early, and thus being ignored because no query is running yet. The test already had logic to wait until the query backend was in the "active" state, before sending a cancel to solve that issue. But my guess is that that somehow isn't enough. Sadly I'm having a hard time reliably reproducing this race condition locally. So it's hard to be sure what is happening here. Attached is a patch with a wild guess as to what the issue might be (i.e. seeing an outdated "active" state and thus passing the check even though the query is not running yet)
v37-0001-Hopefully-make-cancel-test-more-reliable.patch
Description: Binary data
v37-0002-Start-using-new-libpq-cancel-APIs.patch
Description: Binary data