On Mon, Jan 13, 2025 at 09:39:40PM +0100, Matthias van de Meent wrote: > On Mon, 13 Jan 2025 at 20:04, Christoph Berg <c...@df7cb.de> wrote: > > > > Bernd and I have been chasing a bug that happens when all of the > > following conditions are fulfilled: > > > > * PG 15..18 (older PGs are ok) > > * gcc 14.2 on Debian unstable/testing (older Debians and Ubuntus are ok) > > * arm64 running on graviton (AWS EC2 c8g.2xlarge, ok on different arm64 > > host) > > * -O2 (ok with -O0) > > * --with-openssl (ok without openssl) > > * using no -m flag, or using -marm8.4-a (using `-march=armv9-a` fixes it) > > > > The problem happens early during initdb: > > > > $ ./configure --with-openssl --enable-debug > > ... > > $ /usr/local/pgsql/bin/initdb -D broken --no-clean > > ... > > running bootstrap script ... 2025-01-13 18:02:44.484 UTC [523300] FATAL: > > control file contains invalid database cluster state > > child process exited with exit code 1 > > initdb: data directory "broken" not removed at user's request > > Yes, weird. > > > (gdb) disassemble > > Dump of assembler code for function BootStrapXLOG: > > 0x0000aaaaaac21708 <+0>: stp x29, x30, [sp, #-272]! > > 0x0000aaaaaac2170c <+4>: mov w1, #0x0 // > > #0 > > 0x0000aaaaaac21710 <+8>: mov x29, sp > > ... > > => 0x0000aaaaaac219bc <+692>: add x19, sp, #0x90 > > 0x0000aaaaaac219c0 <+696>: mov x0, x19 > > 0x0000aaaaaac219c4 <+700>: mov x1, #0x20 // > > #32 > > 0x0000aaaaaac219c8 <+704>: str w2, [x21, #28] > > 0x0000aaaaaac219cc <+708>: bl 0xaaaaab0ac824 <pg_strong_random> > > pg_strong_random pulls random values from openssl's RAND_bytes > (defined in openssl/rand.h) when PostgreSQL is compiled with openSSL > support. If openSSL isn't enabled we instead use /dev/urandom (on > unix-y systems), which means different code will be generated for > pg_strong_random. > > > 0x0000aaaaaac219d0 <+712>: tbz w0, #0, 0xaaaaaac21b28 > > <BootStrapXLOG+1056> > > 0x0000aaaaaac219d4 <+716>: ldr x3, [x22, #32] > > 0x0000aaaaaac219d8 <+720>: mov x2, #0x128 // > > #296 > > 0x0000aaaaaac219dc <+724>: mov w1, #0x0 // > > #0 > > 0x0000aaaaaac219e0 <+728>: mov x0, x3 > > 0x0000aaaaaac219e4 <+732>: bl 0xaaaaaab7f3b0 <memset@plt> > > Given this code, it looks like register x3 contains ControlFile - it's > being memset(..., 0, sizeof(ControlFileData)); > > > 0x0000aaaaaac219e8 <+736>: mov x3, x0 > > 0x0000aaaaaac219ec <+740>: mov x1, #0x3e8 // > > #1000 > > 0x0000aaaaaac219f0 <+744>: ldr w9, [x21, #32] > > 0x0000aaaaaac219f4 <+748>: adrp x7, 0xaaaaab3ce000 > > <fmgr_builtins+72112> > > 0x0000aaaaaac219f8 <+752>: ldr x7, [x7, #3720] > > 0x0000aaaaaac219fc <+756>: str x1, [x3, #128] > > ... Which would make this the assignment to unloggedLSN (which matches > the FirstNormalUnloggedLSN=1000 stored just above) > > > 0x0000aaaaaac21a00 <+760>: ldr w1, [sp, #120] > > 0x0000aaaaaac21a04 <+764>: add x0, x0, #0x28 > > 0x0000aaaaaac21a08 <+768>: str x23, [x3] > > And this would be the assignment of systemidentifier, > > > 0x0000aaaaaac21a0c <+772>: str w1, [x3, #252] > > ... data_checksum_version, > > > 0x0000aaaaaac21a10 <+776>: adrp x6, 0xaaaaab3cf000 > > 0x0000aaaaaac21a14 <+780>: ldr x6, [x6, #2392] > > 0x0000aaaaaac21a18 <+784>: adrp x5, 0xaaaaab3cf000 > > 0x0000aaaaaac21a1c <+788>: ldr x5, [x5, #2960] > > 0x0000aaaaaac21a20 <+792>: adrp x4, 0xaaaaab3cf000 > > 0x0000aaaaaac21a24 <+796>: ldr x4, [x4, #3352] > > 0x0000aaaaaac21a28 <+800>: ldp q26, q25, [x19] > > 0x0000aaaaaac21a2c <+804>: str s15, [x3, #16] > > ... and finally ControlFile->state. > > I don't see where s15 is initialized and/or written to first, but this > is the only reference in this section of ASM. As such, I think the > initialization (presumably, "mov s15, #1" or such) must have happened > before the call to pg_secure_rand/RAND_bytes. > > Looking around on the internet, it seems that in the ARM Procedure > Call Standard register s15 does not need to be preserved, and thus > could be clobbered when we're going into pg_secure_rand and co. If the > register is was indeed clobbered by OpenSSL, that would be a good > explanation for these issues. Can you check this? > > > The really weird thing is that the very same binaries work on a > > different host (arm64 VM provided by Huawei) - the > > postgresql_arm64.deb files compiled there and present on > > apt.postgresql.org are fine, but when installed on that graviton VM, > > they throw the above error. > > If I were you, I'd start looking into the differences in behaviour of > OpenSSL between the two ARM-based systems you mention; particularly > with a focus on register contents. It looks like gdb's `i r ...` > command could help out with that - or so StackOverflow tells me.
This was all very helpful and if I paid more attention I'd have seen it sooner but here we go: https://github.com/openssl/openssl/pull/26469 I believe this should fix your issue as well, I was debugging it from the APT side for the past 14 hours or so. The AES-CTR code is used by the default random number generator to derive random numbers from the initial seed. -- debian developer - deb.li/jak | jak-linux.org - free software dev ubuntu core developer i speak de, en