Hi, On 2022-02-10 14:26:59 -0800, Andres Freund wrote: > On 2022-02-11 09:10:38 +1300, Thomas Munro wrote: > > It seems like I should go ahead and do that today, and we can study > > further uses for PROCSIGNAL_BARRIER_SMGRRELEASE in follow-on work? > > Yes.
I wrote a test to show the problem. While looking at the code I unfortunately found that CREATE DATABASE ... OID isn't the only problem. Two consecutive ALTER DATABASE ... SET TABLESPACE also can cause corruption. The test doesn't work on < 15 (due to PostgresNode renaming and use of allow_in_place_tablespaces). But manually it's unfortunately quite reproducible :( Start a server with shared_buffers = 1MB autovacuum=off bgwriter_lru_maxpages=1 bgwriter_delay=10000 these are just to give more time / to require smaller tables. Start two psql sessions s1: \c postgres s1: CREATE DATABASE conflict_db; s1: CREATE TABLESPACE test_tablespace LOCATION '/tmp/blarg'; s1: CREATE EXTENSION pg_prewarm; s2: \c conflict_db s2: CREATE TABLE large(id serial primary key, dataa text, datab text); s2: INSERT INTO large(dataa, datab) SELECT g.i::text, 0 FROM generate_series(1, 4000) g(i); s2: UPDATE large SET datab = 1; -- prevent AtEOXact_SMgr s1: BEGIN; -- Fill shared_buffers with other contents. This needs to write out the prior dirty contents -- leading to opening smgr rels / file descriptors s1: SELECT SUM(pg_prewarm(oid)) warmed_buffers FROM pg_class WHERE pg_relation_filenode(oid) != 0; s2: \c postgres -- unlinks the files s2: ALTER DATABASE conflict_db SET TABLESPACE test_tablespace; -- now new files with the same relfilenode exist s2: ALTER DATABASE conflict_db SET TABLESPACE pg_default; s2: \c conflict_db -- dirty buffers again s2: UPDATE large SET datab = 2; -- this again writes out the dirty buffers. But because nothing forced the smgr handles to be closed, -- fds pointing to the *old* file contents are used. I.e. we'll write to the wrong file s1: SELECT SUM(pg_prewarm(oid)) warmed_buffers FROM pg_class WHERE pg_relation_filenode(oid) != 0; -- verify that everything is [not] ok s2: SELECT datab, count(*) FROM large GROUP BY 1 ORDER BY 1 LIMIT 10; ┌───────┬───────┐ │ datab │ count │ ├───────┼───────┤ │ 1 │ 2117 │ │ 2 │ 67 │ └───────┴───────┘ (2 rows) oops. I don't yet have a good idea how to tackle this in the backbranches. I've started to work on a few debugging aids to find problem like these. Attached are two WIP patches: 1) use fstat(...).st_nlink == 0 to detect dangerous actions on fds with deleted files. That seems to work (almost) everywhere. Optionally, on linux, use readlink(/proc/$pid/fd/$fd) to get the filename. 2) On linux, iterate over PGPROC to get a list of pids. Then iterate over /proc/$pid/fd/ to check for deleted files. This only can be done after we've just made sure there's no fds open to deleted files. Greetings, Andres Freund