Re: wrong fds used for refilenodes after pg_upgrade relfilenode changes Reply-To:

Andres Freund Tue, 22 Feb 2022 01:11:44 -0800

Hi,

On 2022-02-10 14:26:59 -0800, Andres Freund wrote:
> On 2022-02-11 09:10:38 +1300, Thomas Munro wrote:
> > It seems like I should go ahead and do that today, and we can study
> > further uses for PROCSIGNAL_BARRIER_SMGRRELEASE in follow-on work?
> 
> Yes.


I wrote a test to show the problem. While looking at the code I unfortunately
found that CREATE DATABASE ... OID isn't the only problem. Two consecutive
ALTER DATABASE ... SET TABLESPACE also can cause corruption.

The test doesn't work on < 15 (due to PostgresNode renaming and use of
allow_in_place_tablespaces). But manually it's unfortunately quite
reproducible :(

Start a server with

shared_buffers = 1MB
autovacuum=off
bgwriter_lru_maxpages=1
bgwriter_delay=10000

these are just to give more time / to require smaller tables.


Start two psql sessions

s1: \c postgres
s1: CREATE DATABASE conflict_db;
s1: CREATE TABLESPACE test_tablespace LOCATION '/tmp/blarg';
s1: CREATE EXTENSION pg_prewarm;

s2: \c conflict_db
s2: CREATE TABLE large(id serial primary key, dataa text, datab text);
s2: INSERT INTO large(dataa, datab) SELECT g.i::text, 0 FROM generate_series(1, 
4000) g(i);
s2: UPDATE large SET datab = 1;

-- prevent AtEOXact_SMgr
s1: BEGIN;

-- Fill shared_buffers with other contents. This needs to write out the prior 
dirty contents
-- leading to opening smgr rels / file descriptors
s1: SELECT SUM(pg_prewarm(oid)) warmed_buffers FROM pg_class WHERE 
pg_relation_filenode(oid) != 0;

s2: \c postgres
-- unlinks the files
s2: ALTER DATABASE conflict_db SET TABLESPACE test_tablespace;
-- now new files with the same relfilenode exist
s2: ALTER DATABASE conflict_db SET TABLESPACE pg_default;
s2: \c conflict_db
-- dirty buffers again
s2: UPDATE large SET datab = 2;

-- this again writes out the dirty buffers. But because nothing forced the smgr 
handles to be closed,
-- fds pointing to the *old* file contents are used. I.e. we'll write to the 
wrong file
s1: SELECT SUM(pg_prewarm(oid)) warmed_buffers FROM pg_class WHERE 
pg_relation_filenode(oid) != 0;

-- verify that everything is [not] ok
s2: SELECT datab, count(*) FROM large GROUP BY 1 ORDER BY 1 LIMIT 10;

┌───────┬───────┐
│ datab │ count │
├───────┼───────┤
│ 1     │  2117 │
│ 2     │    67 │
└───────┴───────┘
(2 rows)


oops.


I don't yet have a good idea how to tackle this in the backbranches.



I've started to work on a few debugging aids to find problem like
these. Attached are two WIP patches:

1) use fstat(...).st_nlink == 0 to detect dangerous actions on fds with
   deleted files. That seems to work (almost) everywhere. Optionally, on
   linux, use readlink(/proc/$pid/fd/$fd) to get the filename.
2) On linux, iterate over PGPROC to get a list of pids. Then iterate over
   /proc/$pid/fd/ to check for deleted files. This only can be done after
   we've just made sure there's no fds open to deleted files.

Greetings,

Andres Freund

Re: wrong fds used for refilenodes after pg_upgrade relfilenode changes Reply-To:

Reply via email to