On Mon, Oct 21, 2024 at 11:13:59AM +0900, Michael Paquier wrote: > Used "basic" as test name at the end, tweaked the comment close to > what you have suggested, and applied the result.
I have spotted a race condition with this test in some CF bot runs. Here are some of them: https://github.com/michaelpq/postgres/runs/31984161747 https://cirrus-ci.com/task/6741385749987328 https://cirrus-ci.com/task/6056918689513472 https://cirrus-ci.com/task/5863397261049856 https://cirrus-ci.com/task/5565235765968896 https://cirrus-ci.com/task/6238594598174720 All of them show up as follows in regression.diffs: diff -U3 /tmp/cirrus-ci-build/src/test/modules/injection_points/expected/basic.out /tmp/cirrus-ci-build/build/testrun/injection_points/isolation/results/basic.out --- /tmp/cirrus-ci-build/src/test/modules/injection_points/expected/basic.out 2024-10-24 02:36:24.068590000 +0000 +++ /tmp/cirrus-ci-build/build/testrun/injection_points/isolation/results/basic.out 2024-10-24 02:40:09.179695000 +0000 @@ -13,16 +13,16 @@ (1 row) -step wait1: <... completed> -injection_points_run --------------------- - -(1 row) - step detach2: SELECT injection_points_detach('injection-points-wait'); injection_points_detach ----------------------- +(1 row) + +step wait1: <... completed> +injection_points_run +-------------------- + (1 row) This is in the first permutation of the test done with "wait1 wakeup2 detach2", and the diff means that the backend running the "wait" callback is reported as finished after the detach is done, injection_points_run being only used for the waits. Hence the wait is so slow to finish that the detach has time to complete and finish, breaking the output. And here comes the puzzling part: all of failures involve FreeBSD 13 in the CI. Reproducing this failure would not be difficult, I thought first; we can add a hardcoded pg_usleep() to delay the end of injection_wait() so as we make sure that the backend doing the wait reports later than the detach. Or just do the same at the end of injection_points_run() once the callback exits. I've sure done that, placing some strategic pg_usleep() calls on Linux to make the paths that matter in the wait slower, but the test remains stable. The CI on Linux is stable as well: 3rd and 4th columns of the cfbot are green, I did not spot any failures related to this isolation test in injection_points. Only the second column about FreeBSD is going rogue on a periodic basis. One fix would is to remove this first permutation test, still that's hiding a problem rather than solving it, and something looks wrong with conditional variables, specific to FreeBSD? Hence, I'm puzzled by that. I am going to remove the first permutation anyway to stabilize the CI reports. But it feels like we have a different problem here, and I am not sure what. Any thoughts or comments? -- Michael
signature.asc
Description: PGP signature