On Mon, 13 Jan 2025 at 20:04, Christoph Berg <c...@df7cb.de> wrote:
>
> Bernd and I have been chasing a bug that happens when all of the
> following conditions are fulfilled:
>
> * PG 15..18 (older PGs are ok)
> * gcc 14.2 on Debian unstable/testing (older Debians and Ubuntus are ok)
> * arm64 running on graviton (AWS EC2 c8g.2xlarge, ok on different arm64 host)
> * -O2 (ok with -O0)
> * --with-openssl (ok without openssl)
> * using no -m flag, or using -marm8.4-a (using `-march=armv9-a` fixes it)
>
> The problem happens early during initdb:
>
> $ ./configure --with-openssl --enable-debug
> ...
> $ /usr/local/pgsql/bin/initdb -D broken --no-clean
> ...
> running bootstrap script ... 2025-01-13 18:02:44.484 UTC [523300] FATAL:  
> control file contains invalid database cluster state
> child process exited with exit code 1
> initdb: data directory "broken" not removed at user's request

Yes, weird.

> (gdb) disassemble
> Dump of assembler code for function BootStrapXLOG:
>    0x0000aaaaaac21708 <+0>:     stp     x29, x30, [sp, #-272]!
>    0x0000aaaaaac2170c <+4>:     mov     w1, #0x0                        // #0
>    0x0000aaaaaac21710 <+8>:     mov     x29, sp
> ...
> => 0x0000aaaaaac219bc <+692>:   add     x19, sp, #0x90
>    0x0000aaaaaac219c0 <+696>:   mov     x0, x19
>    0x0000aaaaaac219c4 <+700>:   mov     x1, #0x20                       // #32
>    0x0000aaaaaac219c8 <+704>:   str     w2, [x21, #28]
>    0x0000aaaaaac219cc <+708>:   bl      0xaaaaab0ac824 <pg_strong_random>

pg_strong_random pulls random values from openssl's RAND_bytes
(defined in openssl/rand.h) when PostgreSQL is compiled with openSSL
support. If openSSL isn't enabled we instead use /dev/urandom (on
unix-y systems), which means different code will be generated for
pg_strong_random.

>    0x0000aaaaaac219d0 <+712>:   tbz     w0, #0, 0xaaaaaac21b28 
> <BootStrapXLOG+1056>
>    0x0000aaaaaac219d4 <+716>:   ldr     x3, [x22, #32]
>    0x0000aaaaaac219d8 <+720>:   mov     x2, #0x128                      // 
> #296
>    0x0000aaaaaac219dc <+724>:   mov     w1, #0x0                        // #0
>    0x0000aaaaaac219e0 <+728>:   mov     x0, x3
>    0x0000aaaaaac219e4 <+732>:   bl      0xaaaaaab7f3b0 <memset@plt>

Given this code, it looks like register x3 contains ControlFile - it's
being memset(..., 0, sizeof(ControlFileData));

>    0x0000aaaaaac219e8 <+736>:   mov     x3, x0
>    0x0000aaaaaac219ec <+740>:   mov     x1, #0x3e8                      // 
> #1000
>    0x0000aaaaaac219f0 <+744>:   ldr     w9, [x21, #32]
>    0x0000aaaaaac219f4 <+748>:   adrp    x7, 0xaaaaab3ce000 
> <fmgr_builtins+72112>
>    0x0000aaaaaac219f8 <+752>:   ldr     x7, [x7, #3720]
>    0x0000aaaaaac219fc <+756>:   str     x1, [x3, #128]

... Which would make this the assignment to unloggedLSN (which matches
the FirstNormalUnloggedLSN=1000 stored just above)

>    0x0000aaaaaac21a00 <+760>:   ldr     w1, [sp, #120]
>    0x0000aaaaaac21a04 <+764>:   add     x0, x0, #0x28
>    0x0000aaaaaac21a08 <+768>:   str     x23, [x3]

And this would be the assignment of systemidentifier,

>    0x0000aaaaaac21a0c <+772>:   str     w1, [x3, #252]

... data_checksum_version,

>    0x0000aaaaaac21a10 <+776>:   adrp    x6, 0xaaaaab3cf000
>    0x0000aaaaaac21a14 <+780>:   ldr     x6, [x6, #2392]
>    0x0000aaaaaac21a18 <+784>:   adrp    x5, 0xaaaaab3cf000
>    0x0000aaaaaac21a1c <+788>:   ldr     x5, [x5, #2960]
>    0x0000aaaaaac21a20 <+792>:   adrp    x4, 0xaaaaab3cf000
>    0x0000aaaaaac21a24 <+796>:   ldr     x4, [x4, #3352]
>    0x0000aaaaaac21a28 <+800>:   ldp     q26, q25, [x19]
>    0x0000aaaaaac21a2c <+804>:   str     s15, [x3, #16]

... and finally ControlFile->state.

I don't see where s15 is initialized and/or written to first, but this
is the only reference in this section of ASM. As such, I think the
initialization (presumably, "mov s15, #1" or such) must have happened
before the call to pg_secure_rand/RAND_bytes.

Looking around on the internet, it seems that in the ARM Procedure
Call Standard register s15 does not need to be preserved, and thus
could be clobbered when we're going into pg_secure_rand and co. If the
register is was indeed clobbered by OpenSSL, that would be a good
explanation for these issues. Can you check this?

> The really weird thing is that the very same binaries work on a
> different host (arm64 VM provided by Huawei) - the
> postgresql_arm64.deb files compiled there and present on
> apt.postgresql.org are fine, but when installed on that graviton VM,
> they throw the above error.

If I were you, I'd start looking into the differences in behaviour of
OpenSSL between the two ARM-based systems you mention; particularly
with a focus on register contents. It looks like gdb's `i r ...`
command could help out with that - or so StackOverflow tells me.


Kind regards,

Matthias van de Meent


Reply via email to