Re: Accidental removal of a file causing various problems

Pavan Deolasee Fri, 24 Aug 2018 12:27:02 -0700

On Sat, Aug 25, 2018 at 12:16 AM Tom Lane <[email protected]> wrote:

> Pavan Deolasee <[email protected]> writes:
> > 1. The user soon found out that they can no longer connect to any
> database
> > in the cluster. Not just the one to which the affected table belonged,
> but
> > no other database in the cluster. The affected table is a regular user
> > table (actually a toast table).
>
> Please define "can no longer connect".  What happened *exactly*?
> How long did it take to start failing like that (was this perhaps a
> shutdown-because-of-impending-wraparound situation)?
>

The errors were simply about about the missing file. See attached
reproduction script that I created while studying this complaint. It will
throw errors such as:

psql: FATAL:  could not open file "base/12669/16387": No such file or
directory
CONTEXT:  writing block 207724 of relation base/12669/16387

Now of course, the file is really missing. But the user was quite surprised
that they couldn't connect to any database, even though mishap happened to
a user table in one of their reporting databases. To add to their misery,
while restart fixed this error and opened their other databases for regular
operation, it caused the toast corruption.

(The original report obviously complained about whatever was the missing
segment)

>
> > 2. So they restarted the database server. While that fixed the connection
> > problem, they started seeing toast errors on the table to which the
> missing
> > file belonged to. The missing file was recreated at the database restart,
> > but of course it was filled in with all zeroes, causing data corruption.
>
> Doesn't seem exactly surprising, if some toast data went missing.
>

My concern is about recreating a zero-filled file, without even any
warnings. Is that OK? Is that necessary to deal with a common scenario?

>
> > 3. To make things worse, the corruption then got propagated to the
> standbys
> > too. We don't know if the original file removal was replicated to the
> > standby, but it seems unlikely.
>
> This is certainly unsurprising.
>

Again my worry is that we might have corrupted a otherwise good standby by
recreating a zero-filled file and later inserting new data in those blocks.
I wonder if we could have prevented that by requiring an administrative
intervention, instead of happily recreating the file and then overwriting
it.

>
> > I've a test case that reproduce all of these effects if a backend file is
> > forcefully removed,
>
> Let's see it.
>

Attached.

>
> Note that this:
>
> > WARNING:  could not write block 27094010 of base/56972584/56980980
> > DETAIL:  Multiple failures --- write error might be permanent.
> > ERROR:  could not open file "base/56972584/56980980.69" (target block
> > 27094010): previous segment is only 12641 blocks
> > CONTEXT:  writing block 27094010 of relation base/56972584/56980980
>
> does not say that the .69 file is missing.  It says that .68 (or, maybe,
> some even-earlier segment) was smaller than 1GB, which is a different
> matter.  Still data corruption, but I don't think I believe it was a
> stray "rm".
>

Hmm, interesting. It's a kinda old report, but somehow I remember doing
analysis and concluding that an entire segment went missing and not some
blocks in an intermediate segment. I might be wrong though. Will recheck
again.

>
> Oh, and what PG version are we talking about?
>

I think this is reproducible on all versions I have tested so far,
including master.

Thanks,
Pavan

-- 
 Pavan Deolasee                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

export PGDATA=$HOME/DATA/repro
export PGPORT=5444
export PGDATABASE=postgres

initdb -D $PGDATA
rm -f logfile
pg_ctl -D $PGDATA -o "-c port=$PGPORT" -l logfile -w start

## create the table and insert large number of rows so that we hit the next
## segment
psql -c "CREATE TABLE testtab_toast (a text)"
psql -c "INSERT INTO testtab_toast SELECT (SELECT string_agg('', md5(random()::text)) FROM generate_series(1,1000)) FROM generate_series(1,50000);"
TOASTTABLE=$(psql -tA -c "SELECT 'pg_toast.pg_toast_' || oid FROM pg_class WHERE relname = 'testtab_toast'")
SEGFILE=$(psql -tA -c "SELECT pg_relation_filepath('$TOASTTABLE');")

## and now move the old segment to /tmp, while concurrently inserting more rows
psql -c "INSERT INTO testtab_toast SELECT (SELECT string_agg('', md5(random()::text)) FROM generate_series(1,1000)) FROM generate_series(1,50000);"&
sleep 5
mv $PGDATA/$SEGFILE /tmp
sleep 5

## check if connecting to the database (even non-existent) fails
psql postgres -c "SELECT 1"
psql testdb -c "SELECT 1"

## stop and restart the server
pg_ctl -D $PGDATA stop -w
pg_ctl -D $PGDATA -l logfile start -w

## fail?
psql -c "SELECT a FROM testtab_toast LIMIT 1"

# does REINDEX fix it?
psql -c "REINDEX TABLE $TOASTTABLE"
psql -c "SELECT a FROM testtab_toast LIMIT 1"

# does VACUUM FULL fix it?
psql -c "VACUUM FULL $TOASTTABLE"
psql -c "SELECT a FROM testtab_toast LIMIT 1"

pg_ctl -D $PGDATA stop -w

Re: Accidental removal of a file causing various problems

Reply via email to