On Mon, Jan 13, 2025 at 09:39:40PM +0100, Matthias van de Meent wrote:
> On Mon, 13 Jan 2025 at 20:04, Christoph Berg <c...@df7cb.de> wrote:
> >
> > Bernd and I have been chasing a bug that happens when all of the
> > following conditions are fulfilled:
> >
> > * PG 15..18 (older PGs are ok)
> > * gcc 14.2 on Debian unstable/testing (older Debians and Ubuntus are ok)
> > * arm64 running on graviton (AWS EC2 c8g.2xlarge, ok on different arm64 
> > host)
> > * -O2 (ok with -O0)
> > * --with-openssl (ok without openssl)
> > * using no -m flag, or using -marm8.4-a (using `-march=armv9-a` fixes it)
> >
> > The problem happens early during initdb:
> >
> > $ ./configure --with-openssl --enable-debug
> > ...
> > $ /usr/local/pgsql/bin/initdb -D broken --no-clean
> > ...
> > running bootstrap script ... 2025-01-13 18:02:44.484 UTC [523300] FATAL:  
> > control file contains invalid database cluster state
> > child process exited with exit code 1
> > initdb: data directory "broken" not removed at user's request
> 
> Yes, weird.
> 
> > (gdb) disassemble
> > Dump of assembler code for function BootStrapXLOG:
> >    0x0000aaaaaac21708 <+0>:     stp     x29, x30, [sp, #-272]!
> >    0x0000aaaaaac2170c <+4>:     mov     w1, #0x0                        // 
> > #0
> >    0x0000aaaaaac21710 <+8>:     mov     x29, sp
> > ...
> > => 0x0000aaaaaac219bc <+692>:   add     x19, sp, #0x90
> >    0x0000aaaaaac219c0 <+696>:   mov     x0, x19
> >    0x0000aaaaaac219c4 <+700>:   mov     x1, #0x20                       // 
> > #32
> >    0x0000aaaaaac219c8 <+704>:   str     w2, [x21, #28]
> >    0x0000aaaaaac219cc <+708>:   bl      0xaaaaab0ac824 <pg_strong_random>
> 
> pg_strong_random pulls random values from openssl's RAND_bytes
> (defined in openssl/rand.h) when PostgreSQL is compiled with openSSL
> support. If openSSL isn't enabled we instead use /dev/urandom (on
> unix-y systems), which means different code will be generated for
> pg_strong_random.
> 
> >    0x0000aaaaaac219d0 <+712>:   tbz     w0, #0, 0xaaaaaac21b28 
> > <BootStrapXLOG+1056>
> >    0x0000aaaaaac219d4 <+716>:   ldr     x3, [x22, #32]
> >    0x0000aaaaaac219d8 <+720>:   mov     x2, #0x128                      // 
> > #296
> >    0x0000aaaaaac219dc <+724>:   mov     w1, #0x0                        // 
> > #0
> >    0x0000aaaaaac219e0 <+728>:   mov     x0, x3
> >    0x0000aaaaaac219e4 <+732>:   bl      0xaaaaaab7f3b0 <memset@plt>
> 
> Given this code, it looks like register x3 contains ControlFile - it's
> being memset(..., 0, sizeof(ControlFileData));
> 
> >    0x0000aaaaaac219e8 <+736>:   mov     x3, x0
> >    0x0000aaaaaac219ec <+740>:   mov     x1, #0x3e8                      // 
> > #1000
> >    0x0000aaaaaac219f0 <+744>:   ldr     w9, [x21, #32]
> >    0x0000aaaaaac219f4 <+748>:   adrp    x7, 0xaaaaab3ce000 
> > <fmgr_builtins+72112>
> >    0x0000aaaaaac219f8 <+752>:   ldr     x7, [x7, #3720]
> >    0x0000aaaaaac219fc <+756>:   str     x1, [x3, #128]
> 
> ... Which would make this the assignment to unloggedLSN (which matches
> the FirstNormalUnloggedLSN=1000 stored just above)
> 
> >    0x0000aaaaaac21a00 <+760>:   ldr     w1, [sp, #120]
> >    0x0000aaaaaac21a04 <+764>:   add     x0, x0, #0x28
> >    0x0000aaaaaac21a08 <+768>:   str     x23, [x3]
> 
> And this would be the assignment of systemidentifier,
> 
> >    0x0000aaaaaac21a0c <+772>:   str     w1, [x3, #252]
> 
> ... data_checksum_version,
> 
> >    0x0000aaaaaac21a10 <+776>:   adrp    x6, 0xaaaaab3cf000
> >    0x0000aaaaaac21a14 <+780>:   ldr     x6, [x6, #2392]
> >    0x0000aaaaaac21a18 <+784>:   adrp    x5, 0xaaaaab3cf000
> >    0x0000aaaaaac21a1c <+788>:   ldr     x5, [x5, #2960]
> >    0x0000aaaaaac21a20 <+792>:   adrp    x4, 0xaaaaab3cf000
> >    0x0000aaaaaac21a24 <+796>:   ldr     x4, [x4, #3352]
> >    0x0000aaaaaac21a28 <+800>:   ldp     q26, q25, [x19]
> >    0x0000aaaaaac21a2c <+804>:   str     s15, [x3, #16]
> 
> ... and finally ControlFile->state.
> 
> I don't see where s15 is initialized and/or written to first, but this
> is the only reference in this section of ASM. As such, I think the
> initialization (presumably, "mov s15, #1" or such) must have happened
> before the call to pg_secure_rand/RAND_bytes.
> 
> Looking around on the internet, it seems that in the ARM Procedure
> Call Standard register s15 does not need to be preserved, and thus
> could be clobbered when we're going into pg_secure_rand and co. If the
> register is was indeed clobbered by OpenSSL, that would be a good
> explanation for these issues. Can you check this?
> 
> > The really weird thing is that the very same binaries work on a
> > different host (arm64 VM provided by Huawei) - the
> > postgresql_arm64.deb files compiled there and present on
> > apt.postgresql.org are fine, but when installed on that graviton VM,
> > they throw the above error.
> 
> If I were you, I'd start looking into the differences in behaviour of
> OpenSSL between the two ARM-based systems you mention; particularly
> with a focus on register contents. It looks like gdb's `i r ...`
> command could help out with that - or so StackOverflow tells me.

This was all very helpful and if I paid more attention I'd have seen
it sooner but here we go:

https://github.com/openssl/openssl/pull/26469

I believe this should fix your issue as well, I was debugging it
from the APT side for the past 14 hours or so.

The AES-CTR code is used by the default random number generator
to derive random numbers from the initial seed.
-- 
debian developer - deb.li/jak | jak-linux.org - free software dev
ubuntu core developer                              i speak de, en


Reply via email to