On Mon, 13 Jan 2025 at 20:04, Christoph Berg <c...@df7cb.de> wrote: > > Bernd and I have been chasing a bug that happens when all of the > following conditions are fulfilled: > > * PG 15..18 (older PGs are ok) > * gcc 14.2 on Debian unstable/testing (older Debians and Ubuntus are ok) > * arm64 running on graviton (AWS EC2 c8g.2xlarge, ok on different arm64 host) > * -O2 (ok with -O0) > * --with-openssl (ok without openssl) > * using no -m flag, or using -marm8.4-a (using `-march=armv9-a` fixes it) > > The problem happens early during initdb: > > $ ./configure --with-openssl --enable-debug > ... > $ /usr/local/pgsql/bin/initdb -D broken --no-clean > ... > running bootstrap script ... 2025-01-13 18:02:44.484 UTC [523300] FATAL: > control file contains invalid database cluster state > child process exited with exit code 1 > initdb: data directory "broken" not removed at user's request
Yes, weird. > (gdb) disassemble > Dump of assembler code for function BootStrapXLOG: > 0x0000aaaaaac21708 <+0>: stp x29, x30, [sp, #-272]! > 0x0000aaaaaac2170c <+4>: mov w1, #0x0 // #0 > 0x0000aaaaaac21710 <+8>: mov x29, sp > ... > => 0x0000aaaaaac219bc <+692>: add x19, sp, #0x90 > 0x0000aaaaaac219c0 <+696>: mov x0, x19 > 0x0000aaaaaac219c4 <+700>: mov x1, #0x20 // #32 > 0x0000aaaaaac219c8 <+704>: str w2, [x21, #28] > 0x0000aaaaaac219cc <+708>: bl 0xaaaaab0ac824 <pg_strong_random> pg_strong_random pulls random values from openssl's RAND_bytes (defined in openssl/rand.h) when PostgreSQL is compiled with openSSL support. If openSSL isn't enabled we instead use /dev/urandom (on unix-y systems), which means different code will be generated for pg_strong_random. > 0x0000aaaaaac219d0 <+712>: tbz w0, #0, 0xaaaaaac21b28 > <BootStrapXLOG+1056> > 0x0000aaaaaac219d4 <+716>: ldr x3, [x22, #32] > 0x0000aaaaaac219d8 <+720>: mov x2, #0x128 // > #296 > 0x0000aaaaaac219dc <+724>: mov w1, #0x0 // #0 > 0x0000aaaaaac219e0 <+728>: mov x0, x3 > 0x0000aaaaaac219e4 <+732>: bl 0xaaaaaab7f3b0 <memset@plt> Given this code, it looks like register x3 contains ControlFile - it's being memset(..., 0, sizeof(ControlFileData)); > 0x0000aaaaaac219e8 <+736>: mov x3, x0 > 0x0000aaaaaac219ec <+740>: mov x1, #0x3e8 // > #1000 > 0x0000aaaaaac219f0 <+744>: ldr w9, [x21, #32] > 0x0000aaaaaac219f4 <+748>: adrp x7, 0xaaaaab3ce000 > <fmgr_builtins+72112> > 0x0000aaaaaac219f8 <+752>: ldr x7, [x7, #3720] > 0x0000aaaaaac219fc <+756>: str x1, [x3, #128] ... Which would make this the assignment to unloggedLSN (which matches the FirstNormalUnloggedLSN=1000 stored just above) > 0x0000aaaaaac21a00 <+760>: ldr w1, [sp, #120] > 0x0000aaaaaac21a04 <+764>: add x0, x0, #0x28 > 0x0000aaaaaac21a08 <+768>: str x23, [x3] And this would be the assignment of systemidentifier, > 0x0000aaaaaac21a0c <+772>: str w1, [x3, #252] ... data_checksum_version, > 0x0000aaaaaac21a10 <+776>: adrp x6, 0xaaaaab3cf000 > 0x0000aaaaaac21a14 <+780>: ldr x6, [x6, #2392] > 0x0000aaaaaac21a18 <+784>: adrp x5, 0xaaaaab3cf000 > 0x0000aaaaaac21a1c <+788>: ldr x5, [x5, #2960] > 0x0000aaaaaac21a20 <+792>: adrp x4, 0xaaaaab3cf000 > 0x0000aaaaaac21a24 <+796>: ldr x4, [x4, #3352] > 0x0000aaaaaac21a28 <+800>: ldp q26, q25, [x19] > 0x0000aaaaaac21a2c <+804>: str s15, [x3, #16] ... and finally ControlFile->state. I don't see where s15 is initialized and/or written to first, but this is the only reference in this section of ASM. As such, I think the initialization (presumably, "mov s15, #1" or such) must have happened before the call to pg_secure_rand/RAND_bytes. Looking around on the internet, it seems that in the ARM Procedure Call Standard register s15 does not need to be preserved, and thus could be clobbered when we're going into pg_secure_rand and co. If the register is was indeed clobbered by OpenSSL, that would be a good explanation for these issues. Can you check this? > The really weird thing is that the very same binaries work on a > different host (arm64 VM provided by Huawei) - the > postgresql_arm64.deb files compiled there and present on > apt.postgresql.org are fine, but when installed on that graviton VM, > they throw the above error. If I were you, I'd start looking into the differences in behaviour of OpenSSL between the two ARM-based systems you mention; particularly with a focus on register contents. It looks like gdb's `i r ...` command could help out with that - or so StackOverflow tells me. Kind regards, Matthias van de Meent