Hi Michael,

> The scans may be quite long as well, actually, which could be a
> bottleneck.  Did you measure the runtime with a maximized (still
> realistic) pool of files for these SLRUs in the upgrade time?  For
> upgrades, data would be the neck.

Good question.

In theory SLRUs are not supposed to grow large and their size is a
small fraction of the rest of the database. As an example CLOG (
pg_xact/ ) stores 2 bits per transaction. Since every SLRU has a
dedicated directory and we scan just it, non-SLRU files don't affect
the scan time.

To make sure I asked several people to check how many SLRUs they have
in the prod environment. The typical response looked like this:

```
$PGDATA/pg_xact: 191 segments
$PGDATA/pg_commit_ts: 3
$PGDATA/pg_multixact/offsets: 148
$PGDATA/pg_multixact/members: 400
$PGDATA/pg_subtrans: 4
$PGDATA/pg_serial: 3
```

This is a 800 Gb database. Interestingly larger databases (4.2Tb) may
have much less SLRU segments (220 in total, most of them are pg_xact).

And here is the *worst* case that was reported to me:

```
$PGDATA/pg_xact: 171 segments
$PGDATA/pg_commit_ts: 3
$PGDATA/pg_multixact/offsets: 4864
$PGDATA/pg_multixact/members: 40996
$PGDATA/pg_subtrans: 5
$PGDATA/pg_serial: 3
```

I was told this is a "1Tb+" database. For this user pg_upgrade will
rename 45 000 files. I wrote a little script to check how much time it
will take:

```
#!/usr/bin/env perl

use strict;

my $from = "test_0001.tmp";
my $to = "test_0002.tmp";

system("touch $from");

for my $i (1..45000) {
    rename($from, $to);
    ($from, $to) = ($to, $from);
}
```

On my laptop I get 0.5 seconds. Note that I don't do scanning, only
renaming, assuming that the recent should take most of the time. I
think this should be multiplied by 10 to take into account the role of
the filesystem cache and other factors.

All in all in the absolutely worst case scenario this shouldn't take
more than 5 seconds, in reality it will probably be orders of
magnitude less.

> Note that this also depends on the system endianness, see 039_end_of_wal.pl.

Sure, I think I took it into account when using pack("L!"). My
understanding is that "L" takes care of the endiness since I see
special flags to force little- or big-endiness independently from the
platform [1]. This of course should be tested in practice on different
machines. Using an exclamation mark in "L!" was a mistake since
cat_ver is not an int, but rather an uint32.

> You don't really need the lookup part, actually?

For lookup we already have the pg_controldata tool, that's not a problem.

> Control file manipulation may be useful as a routine in Cluster.pm,
> based on an offset in the file and a format to pack as argument?
> [...]
> It's one of these things I could see myself reuse to force a state in
> the cluster and make a test cheaper, for example.

> You would just need the part where
> the control file is rewritten, which should be OK as long as the
> cluster is freshly initdb'd meaning that there should be nothing that
> interacts with the new value set.

Agree. Still I don't see a good way of figuring out
sizeof(ControlFileData) from Perl. The structure has int's in it (e.g.
wal_level, MaxConnections, etc) thus the size is platform-dependent.
The CRC should be placed at the end of the structure. If we want to
manipulate MaxConnections etc their offsets are going to be
platform-dependent as well. And my understanding is that the alignment
is platform/compiler dependent too.

I guess we are going to need either a `pg_writecontoldata` tool or
`pg_controldata -w` flag. I wonder which option you find more
attractive, or maybe you have better ideas?

[1]: https://perldoc.perl.org/functions/pack

-- 
Best regards,
Aleksander Alekseev


Reply via email to