We should stop telling users to "vacuum that database in single-user mode"

Hannu Krosing Mon, 01 Mar 2021 07:32:53 -0800

It looks like we are unnecessarily instructing our usiers to vacuum their
databases in single-user mode when just vacuuming would be enough.


We should fix the error message to be less misleading.

== The story

I think most of us have at some time seen the following message, if not in their
own database, then at some client.

ERROR: database is not accepting commands to avoid wraparound data
loss in database "<dbname>"
HINT: Stop the postmaster and vacuum that database in single-user mode.
You might also need to commit or roll back old prepared transactions.

But "vacuum that database in single-user mode" is the very last thing
one wants to
do, because

*   it is single-user mode, so nothing else works ...
*   CHECKPOINTs are not running, so all the WAL segments can not be rotated and
    reused
*   Replication does not work, so after vacuum is done and database is started
    in normal mode, there is huge backlog to replicate
*   pg_stat_progress_vacuum is not available so you have no idea when the
    command is going to complete
*   VACUUM VERBOSE isn't - there is absolutely no output from single-user mode
    vacuum with or without VERBOSE, so you *really* do not know what is going on
    and how much progress is made (if you are locky you can guess something from
    IO and CPU monitoring, but it is inexact at best )

When I started looking at improving the situation I discovered, that there
already is no need to run VACUUM in single user mode in any currently supported
PostgreSQL version as you can run VACUUM perfectly well when the wraparound
protection is active.

It worked in all PG versions from v9.6 to v13.

I also tested v 8.3 as this is where we added virtual transactions, but there
VACUUM really fails to run successfully without single-user mode..

So my proposal is to change the error message [*]  to something that does not
suggest the single-user mode as the requirement for running VACUUM.
Also COMMIT PREPARED still works ok in this situation.

Single-user mode still may be needed in case one needs to drop a
replication slot
or something similar.

[*] The message is in src/backend/access/transam/varsup.c around line 120

=== How to test

The following instructions let you run into wraparound in about an hour,
depending on your setup (was 1.2 hours on my laptop)

==== First, set some flags

To allow PREPARE TRANSACTION to block VACUUM cleanup

```
alter system set max_prepared_transactions = 10;
```

Also set *_min_messages to errors, unless you want to get 10M of
WARNINGs (~4GB) to
logs and the same amount sent to client, slowing down the last 10M transactions
significantly.

```
alter system set log_min_messages = error;
alter system set client_min_messages = error;
```

==== Restart the system to activate the settings

==== Block Vacuum from cleaning up transactions

Create a database `wraptest` and connect to it, then

```
create table t (i int);

BEGIN;
insert into t values(1);
PREPARE TRANSACTION 'trx_id_pin';
```

Now you have a prepared transaction, which makes sure that even well-tuned
autovacuum does not prevent running into the wraparound protection.

```
[local]:5096 hannu@wraptest=# SELECT * FROM pg_prepared_xacts;
 transaction |    gid     |           prepared            | owner | database
-------------+------------+-------------------------------+-------+----------
         593 | trx_id_pin | 2021-03-01 08:57:27.024777+01 | hannu | wraptest
(1 row)
```

==== Create a function to consume transaction ids as fast as possible:

```
CREATE OR REPLACE FUNCTION trx_eater(n int) RETURNS void
LANGUAGE plpgsql
AS $plpgsql$
BEGIN
 FOR i IN 0..n LOOP
   BEGIN
     INSERT INTO t values(i);
   EXCEPTION WHEN OTHERS THEN
       RAISE; -- raise it again, so that we actually err out on wraparound
   END;
 END LOOP;
END;
$plpgsql$;
```

==== Use pgbench to drive this function

Make a pgbench command file

$ echo 'select trx_eater(100000);' > trx_eater.pgbench

and start pgbench to run this function in a few backends in parallel

$ pgbench -c 16 -T 20000 -P 60 -n wraptest -f trx_eater.pgbench

==== Wait 1-2 hours

In about an hour or two this should error out with

ERROR: database is not accepting commands to avoid wraparound data
loss in database "postgres"
HINT: Stop the postmaster and vacuum that database in single-user mode.
You might also need to commit or roll back old prepared transactions.

After this just do

COMMIT PREPARED 'trx_id_pin';

==== Verify that VACUUM still works

to release the blocket 2PC transaction and you can verify yourself that

*   you can run VACUUM on any table, and
*   Autovacuum is working, and will eventually clear up the situation

If you have not tuned autovacuum_vacuum_cost_* at all, especially in earlier
versions where it is 20ms by default the autovacuum-started vacuum is
running really
slowly, and it will take about 8 hours to clean up the table, but this
can be sped up
if you set autovacuum_vacuum_cost_delay=0 and then either restart the database
or just kill the vacuum process after reloading flags. After this it
should complete in
15-30 min, after which the database is available for writes again.

Cheers,
Hannu Krosing

We should stop telling users to "vacuum that database in single-user mode"

Reply via email to