Hi, On 2018-04-06 02:28:17 +0200, Daniel Gustafsson wrote: > Applying this makes the _cancel test pass, moving the failure instead to the > following _enable test (which matches what coypu and mylodon are seeing).
FWIW, I'm somewhat annoyed that I'm now spending time debugging this to get the buildfarm green again. I'm fairly certain that the bug here is a simple race condition in the test (not the main code!): The flag informing whether the worker has started is cleared via an on_shmem_exit() hook: static void launcher_exit(int code, Datum arg) { ChecksumHelperShmem->abort = false; pg_atomic_clear_flag(&ChecksumHelperShmem->launcher_started); } but the the wait in the test is done via functions like: CREATE OR REPLACE FUNCTION test_checksums_on() RETURNS boolean AS $$ DECLARE enabled boolean; BEGIN LOOP SELECT setting = 'on' INTO enabled FROM pg_catalog.pg_settings WHERE name = 'data_checksums'; IF enabled THEN EXIT; END IF; PERFORM pg_sleep(1); END LOOP; RETURN enabled; END; $$ LANGUAGE plpgsql; INSERT INTO t1 (b, c) VALUES (generate_series(1,10000), 'starting values'); CREATE OR REPLACE FUNCTION test_checksums_off() RETURNS boolean AS $$ DECLARE enabled boolean; BEGIN PERFORM pg_sleep(1); SELECT setting = 'off' INTO enabled FROM pg_catalog.pg_settings WHERE name = 'data_checksums'; RETURN enabled; END; $$ LANGUAGE plpgsql; which just waits for setting checksums to have finished. It's exceedingly unsurprising that a 'pg_sleep(1)' is not a reliable way to make sure that a process has finished exiting. Then followup tests fail because the process is still running Also: CREATE OR REPLACE FUNCTION reader_loop() RETURNS boolean AS $$ DECLARE counter integer; BEGIN FOR counter IN 1..30 LOOP PERFORM count(a) FROM t1; PERFORM pg_sleep(0.2); END LOOP; RETURN True; END; $$ LANGUAGE plpgsql; } really? Let's just force the test take at least 6s purely from sleeping? Greetings, Andres Freund