Bernd and I have been chasing a bug that happens when all of the
following conditions are fulfilled:

* PG 15..18 (older PGs are ok)
* gcc 14.2 on Debian unstable/testing (older Debians and Ubuntus are ok)
* arm64 running on graviton (AWS EC2 c8g.2xlarge, ok on different arm64 host)
* -O2 (ok with -O0)
* --with-openssl (ok without openssl)
* using no -m flag, or using -marm8.4-a (using `-march=armv9-a` fixes it)

The problem happens early during initdb:

$ ./configure --with-openssl --enable-debug
...
$ /usr/local/pgsql/bin/initdb -D broken --no-clean
...
running bootstrap script ... 2025-01-13 18:02:44.484 UTC [523300] FATAL:  
control file contains invalid database cluster state
child process exited with exit code 1
initdb: data directory "broken" not removed at user's request

Looking at the control file, we can see that the cluster state is
"starting up":

$ /usr/local/pgsql/bin/pg_controldata broken/
pg_control version number:            1700
Catalog version number:               202501101
Database system identifier:           7459462110308027428
Database cluster state:               starting up
pg_control last modified:             Mon 13 Jan 2025 06:02:44 PM UTC
Latest checkpoint location:           0/1000028
Latest checkpoint's REDO location:    0/1000028
Latest checkpoint's REDO WAL file:    000000010000000000000001
Latest checkpoint's TimeLineID:       1
Latest checkpoint's PrevTimeLineID:   1
Latest checkpoint's full_page_writes: on
Latest checkpoint's NextXID:          0:3

The relevant code is in BootStrapXLOG():

    /* Now create pg_control */
    InitControlFile(sysidentifier, data_checksum_version);
    ControlFile->time = checkPoint.time;
    ControlFile->checkPoint = checkPoint.redo;
    ControlFile->checkPointCopy = checkPoint;

    /* some additional ControlFile fields are set in WriteControlFile() */
    WriteControlFile();

and InitControlFile():

    if (!pg_strong_random(mock_auth_nonce, MOCK_AUTH_NONCE_LEN))
        ereport(PANIC,
                (errcode(ERRCODE_INTERNAL_ERROR),
                 errmsg("could not generate secret authorization token")));

    memset(ControlFile, 0, sizeof(ControlFileData));
    /* Initialize pg_control status fields */
    ControlFile->system_identifier = sysidentifier;
    memcpy(ControlFile->mock_authentication_nonce, mock_auth_nonce, 
MOCK_AUTH_NONCE_LEN);
    ControlFile->state = DB_SHUTDOWNED;

So the state should actually be DB_SHUTDOWNED (1), but on this system,
the value is DB_STARTUP (0).

Stepping through InitControlFile we can see that ControlFile->state is
never written to. (The trace jumps back to InitControlFile a lot
because it seems inlined into BootStrapXLOG):

$ cd broken
$ rm -f global/pg_control; PGDATA=$PWD gdb /usr/lib/postgresql/17/bin/postgres
Reading symbols from /usr/local/pgsql/bin/postgres...
(gdb) b InitControlFile
Breakpoint 1 at 0x1819bc: file xlog.c, line 4214.
(gdb) r --boot -F -c log_checkpoints=false -X 16777216 -k
Starting program: /usr/local/pgsql/bin/postgres --boot -F -c 
log_checkpoints=false -X 16777216 -k

Breakpoint 1, 0x0000aaaaaac219bc in InitControlFile (sysidentifier=<optimized 
out>, data_checksum_version=<optimized out>) at xlog.c:4214
4214            if (!pg_strong_random(mock_auth_nonce, MOCK_AUTH_NONCE_LEN))
(gdb) s
5175            InitControlFile(sysidentifier, data_checksum_version);
(gdb)
InitControlFile (sysidentifier=7459466832287685723, data_checksum_version=1) at 
xlog.c:4214
4214            if (!pg_strong_random(mock_auth_nonce, MOCK_AUTH_NONCE_LEN))
(gdb)
pg_strong_random (buf=buf@entry=0xfffffffff670, len=len@entry=32) at 
pg_strong_random.c:79
79              for (i = 0; i < NUM_RAND_POLL_RETRIES; i++)
(gdb)
81                      if (RAND_status() == 1)
(gdb)
87                      RAND_poll();
(gdb)
90              if (RAND_bytes(buf, len) == 1)
(gdb)
InitControlFile (sysidentifier=7459466832287685723, data_checksum_version=1) at 
xlog.c:4219
4219            memset(ControlFile, 0, sizeof(ControlFileData));
(gdb)
4221            ControlFile->system_identifier = sysidentifier;
(gdb)
5175            InitControlFile(sysidentifier, data_checksum_version);
(gdb)
0x0000aaaaaac21a08 in InitControlFile (sysidentifier=7459466832287685723, 
data_checksum_version=1) at xlog.c:4221
4221            ControlFile->system_identifier = sysidentifier;
(gdb)
4222            memcpy(ControlFile->mock_authentication_nonce, mock_auth_nonce, 
MOCK_AUTH_NONCE_LEN);
(gdb)
5175            InitControlFile(sysidentifier, data_checksum_version);
(gdb)
0x0000aaaaaac21a3c in InitControlFile (sysidentifier=7459466832287685723, 
data_checksum_version=1) at xlog.c:4234
4234            ControlFile->track_commit_timestamp = track_commit_timestamp;
(gdb)
5175            InitControlFile(sysidentifier, data_checksum_version);
(gdb)
0x0000aaaaaac21a4c in InitControlFile (sysidentifier=7459466832287685723, 
data_checksum_version=1) at xlog.c:4232
4232            ControlFile->wal_level = wal_level;
(gdb)
5175            InitControlFile(sysidentifier, data_checksum_version);
(gdb)
0x0000aaaaaac21a6c in InitControlFile (sysidentifier=7459466832287685723, 
data_checksum_version=1) at xlog.c:4233
4233            ControlFile->wal_log_hints = wal_log_hints;
(gdb)
5175            InitControlFile(sysidentifier, data_checksum_version);
(gdb)
0x0000aaaaaac21a8c in InitControlFile (sysidentifier=7459466832287685723, 
data_checksum_version=<optimized out>) at xlog.c:4233
4233            ControlFile->wal_log_hints = wal_log_hints;
(gdb)
BootStrapXLOG (data_checksum_version=data_checksum_version@entry=1) at 
xlog.c:5181
5181            WriteControlFile();

(gdb) p *ControlFile
$1 = {system_identifier = 7459466832287685723, pg_control_version = 1700, 
catalog_version_no = 202501101, state = DB_STARTUP, time = 1736792463,
  checkPoint = 16777256, checkPointCopy = {redo = 16777256, ThisTimeLineID = 1, 
PrevTimeLineID = 1, fullPageWrites = true, wal_level = 1, nextXid = {
      value = 3}, nextOid = 10000, nextMulti = 1, nextMultiOffset = 0, 
oldestXid = 3, oldestXidDB = 1, oldestMulti = 1, oldestMultiDB = 1,
    time = 1736792463, oldestCommitTsXid = 0, newestCommitTsXid = 0, 
oldestActiveXid = 0}, unloggedLSN = 1000, minRecoveryPoint = 0,
  minRecoveryPointTLI = 0, backupStartPoint = 0, backupEndPoint = 0, 
backupEndRequired = false, wal_level = 1, wal_log_hints = false, MaxConnections 
= 100,
  max_worker_processes = 8, max_wal_senders = 10, max_prepared_xacts = 0, 
max_locks_per_xact = 64, track_commit_timestamp = false, maxAlign = 8,
  floatFormat = 1234567, blcksz = 8192, relseg_size = 131072, xlog_blcksz = 
8192, xlog_seg_size = 16777216, nameDataLen = 64, indexMaxKeys = 32,
  toast_max_chunk_size = 1996, loblksize = 2048, float8ByVal = true, 
data_checksum_version = 1,
  mock_authentication_nonce = "*\307\177t\215\362\344 
\326\307I\374\005f7v@\242ə\265\230\273#+\301\t\212\204\377\004A", crc = 
4294967295}

Disassembling at the breakpoint:

(gdb) disassemble
Dump of assembler code for function BootStrapXLOG:
   0x0000aaaaaac21708 <+0>:     stp     x29, x30, [sp, #-272]!
   0x0000aaaaaac2170c <+4>:     mov     w1, #0x0                        // #0
   0x0000aaaaaac21710 <+8>:     mov     x29, sp
...
=> 0x0000aaaaaac219bc <+692>:   add     x19, sp, #0x90
   0x0000aaaaaac219c0 <+696>:   mov     x0, x19
   0x0000aaaaaac219c4 <+700>:   mov     x1, #0x20                       // #32
   0x0000aaaaaac219c8 <+704>:   str     w2, [x21, #28]
   0x0000aaaaaac219cc <+708>:   bl      0xaaaaab0ac824 <pg_strong_random>
   0x0000aaaaaac219d0 <+712>:   tbz     w0, #0, 0xaaaaaac21b28 
<BootStrapXLOG+1056>
   0x0000aaaaaac219d4 <+716>:   ldr     x3, [x22, #32]
   0x0000aaaaaac219d8 <+720>:   mov     x2, #0x128                      // #296
   0x0000aaaaaac219dc <+724>:   mov     w1, #0x0                        // #0
   0x0000aaaaaac219e0 <+728>:   mov     x0, x3
   0x0000aaaaaac219e4 <+732>:   bl      0xaaaaaab7f3b0 <memset@plt>
   0x0000aaaaaac219e8 <+736>:   mov     x3, x0
   0x0000aaaaaac219ec <+740>:   mov     x1, #0x3e8                      // #1000
   0x0000aaaaaac219f0 <+744>:   ldr     w9, [x21, #32]
   0x0000aaaaaac219f4 <+748>:   adrp    x7, 0xaaaaab3ce000 <fmgr_builtins+72112>
   0x0000aaaaaac219f8 <+752>:   ldr     x7, [x7, #3720]
   0x0000aaaaaac219fc <+756>:   str     x1, [x3, #128]
   0x0000aaaaaac21a00 <+760>:   ldr     w1, [sp, #120]
   0x0000aaaaaac21a04 <+764>:   add     x0, x0, #0x28
   0x0000aaaaaac21a08 <+768>:   str     x23, [x3]
   0x0000aaaaaac21a0c <+772>:   str     w1, [x3, #252]
   0x0000aaaaaac21a10 <+776>:   adrp    x6, 0xaaaaab3cf000
   0x0000aaaaaac21a14 <+780>:   ldr     x6, [x6, #2392]
   0x0000aaaaaac21a18 <+784>:   adrp    x5, 0xaaaaab3cf000
   0x0000aaaaaac21a1c <+788>:   ldr     x5, [x5, #2960]
   0x0000aaaaaac21a20 <+792>:   adrp    x4, 0xaaaaab3cf000
   0x0000aaaaaac21a24 <+796>:   ldr     x4, [x4, #3352]
   0x0000aaaaaac21a28 <+800>:   ldp     q26, q25, [x19]
   0x0000aaaaaac21a2c <+804>:   str     s15, [x3, #16]
   0x0000aaaaaac21a30 <+808>:   adrp    x2, 0xaaaaab3cf000
   0x0000aaaaaac21a34 <+812>:   ldr     x2, [x2, #2552]
   0x0000aaaaaac21a38 <+816>:   str     x26, [x3, #24]
   0x0000aaaaaac21a3c <+820>:   adrp    x1, 0xaaaaab3ce000 <fmgr_builtins+72112>
   0x0000aaaaaac21a40 <+824>:   ldr     x1, [x1, #3320]
   0x0000aaaaaac21a44 <+828>:   ldp     q27, q29, [x20]
   0x0000aaaaaac21a48 <+832>:   str     x25, [x3, #32]
   0x0000aaaaaac21a4c <+836>:   str     w9, [x3, #172]
   0x0000aaaaaac21a50 <+840>:   ldr     w4, [x4]
   0x0000aaaaaac21a54 <+844>:   ldr     w5, [x5]
   0x0000aaaaaac21a58 <+848>:   ldr     w6, [x6]
   0x0000aaaaaac21a5c <+852>:   ldr     w7, [x7]
   0x0000aaaaaac21a60 <+856>:   ldr     q30, [x20, #64]
   0x0000aaaaaac21a64 <+860>:   ldp     q28, q31, [x20, #32]
   0x0000aaaaaac21a68 <+864>:   stur    q27, [x3, #40]
   0x0000aaaaaac21a6c <+868>:   ldrb    w8, [x22, #240]
   0x0000aaaaaac21a70 <+872>:   ldr     w2, [x2]
   0x0000aaaaaac21a74 <+876>:   ldrb    w1, [x1]
   0x0000aaaaaac21a78 <+880>:   stp     w7, w6, [x3, #180]
   0x0000aaaaaac21a7c <+884>:   stp     w5, w4, [x3, #188]
   0x0000aaaaaac21a80 <+888>:   strb    w1, [x3, #200]
   0x0000aaaaaac21a84 <+892>:   ldr     x1, [x20, #80]
   0x0000aaaaaac21a88 <+896>:   str     x1, [x3, #120]
   0x0000aaaaaac21a8c <+900>:   strb    w8, [x3, #176]
   0x0000aaaaaac21a90 <+904>:   str     w2, [x3, #196]
   0x0000aaaaaac21a94 <+908>:   stp     q26, q25, [x3, #256]
   0x0000aaaaaac21a98 <+912>:   stp     q29, q28, [x0, #16]
   0x0000aaaaaac21a9c <+916>:   stp     q31, q30, [x0, #48]
   0x0000aaaaaac21aa0 <+920>:   bl      0xaaaaaac1dcf0 <WriteControlFile>
...


The really weird thing is that the very same binaries work on a
different host (arm64 VM provided by Huawei) - the
postgresql_arm64.deb files compiled there and present on
apt.postgresql.org are fine, but when installed on that graviton VM,
they throw the above error.

It smells like graviton's arm9 isn't as backwards compatible to arm8
as it should be. (But then I don't understand why disabling openssl
fixes it.)

Christoph


Reply via email to