Hi -hackers, While chasing some other bug I've learned that backtrace_functions might be misleading with top elog/ereport() address.
Reproducer: # using Tom's reproducer on master: wget https://www.postgresql.org/message-id/attachment/112394/ri-collation-bug-example.sql echo '' >> ri-collation-bug-example.sql echo '\errverbose' >> ri-collation-bug-example.sql -- based on grepping the source code locations many others could go in here: psql -p 5437 -c "alter system set backtrace_functions = 'RI_FKey_cascade_del,get_collation_isdeterministic';" psql -p 5437 -c "select pg_reload_conf();" dropdb -p 5437 test1 ; createdb -p 5437 test1 ; psql -p 5437 -d test1 -f ri-collation-bug-example.sql gives (note "get_collation_isdeterministic"): psql:ri-collation-bug-example.sql:42: ERROR: cache lookup failed for collation 0 psql:ri-collation-bug-example.sql:44: error: ERROR: XX000: cache lookup failed for collation 0 LOCATION: get_collation_isdeterministic, lsyscache.c:1088 and in log note the 0x14c6bb: 2024-03-27 14:39:13.065 CET [12899] postgres@test1 ERROR: cache lookup failed for collation 0 2024-03-27 14:39:13.065 CET [12899] postgres@test1 BACKTRACE: postgres: 16/main: postgres test1 [local] DELETE(+0x14c6bb) [0x5565c5a9c6bb] postgres: 16/main: postgres test1 [local] DELETE(RI_FKey_cascade_del+0x323) [0x5565c5ec0973] postgres: 16/main: postgres test1 [local] DELETE(+0x2d401f) [0x5565c5c2401f] postgres: 16/main: postgres test1 [local] DELETE(+0x2d5a04) [0x5565c5c25a04] postgres: 16/main: postgres test1 [local] DELETE(AfterTriggerEndQuery+0x78) [0x5565c5c2a918] [..] 2024-03-27 14:39:13.065 CET [12899] postgres@test1 STATEMENT: delete from revisions where id='5c617ce7-688d-4bea-9d66-c0f0ebc635da'; based on \errverbose it is OK (error matches the code, thanks to Alvaro for this hint): 9 bool 8 get_collation_isdeterministic(Oid colloid) 7 { 6 │ HeapTuple> tp; 5 │ Form_pg_collation colltup; 4 │ bool> > result; 3 │ 2 │ tp = SearchSysCache1(COLLOID, ObjectIdGetDatum(colloid)); 1 │ if (!HeapTupleIsValid(tp)) 1088 │ │ elog(ERROR, "cache lookup failed for collation %u", colloid); [..] but based on backtrace address 0x14c6bb (!) and it resolves differently which seems to be highly misleading (note the "get_language_name.cold"): $ addr2line -e /usr/lib/postgresql/16/bin/postgres -a -f 0x14c6bb 0x000000000014c6bb get_language_name.cold ./build/src/backend/utils/cache/./build/../src/backend/utils/cache/lsyscache.c:1181 When i disassemble the get_collation_isdeterministic() (and I know the name from \errverbose): Dump of assembler code for function get_collation_isdeterministic: Address range 0x5c7680 to 0x5c76c1: 0x00000000005c7680 <+0>: push %rbp [..] 0x00000000005c7694 <+20>: call 0x5d63b0 <SearchSysCache1> 0x00000000005c7699 <+25>: test %rax,%rax 0x00000000005c769c <+28>: je 0x14c686 <get_collation_isdeterministic.cold> 0x00000000005c76a2 <+34>: mov %rax,%rdi [..] 0x00000000005c76bf <+63>: leave 0x00000000005c76c0 <+64>: ret Address range 0x14c686 to 0x14c6bb: 0x000000000014c686 <-4698106>: xor %esi,%esi 0x000000000014c688 <-4698104>: mov $0x15,%edi 0x000000000014c68d <-4698099>: call 0x14ec86 <errstart_cold> 0x000000000014c692 <-4698094>: mov %r12d,%esi 0x000000000014c695 <-4698091>: lea 0x5028dc(%rip),%rdi # 0x64ef78 0x000000000014c69c <-4698084>: xor %eax,%eax 0x000000000014c69e <-4698082>: call 0x5df320 <errmsg_internal> 0x000000000014c6a3 <-4698077>: lea 0x6311a6(%rip),%rdx # 0x77d850 <__func__.34> 0x000000000014c6aa <-4698070>: mov $0x440,%esi 0x000000000014c6af <-4698065>: lea 0x630c8a(%rip),%rdi # 0x77d340 0x000000000014c6b6 <-4698058>: call 0x5df100 <errfinish> << NOTE the exact 0x14c6bb is even missing here(!) notice the interesting thing here: according to GDB get_collation_isdeterministic() is @ 0x5c7680 + jump to 0x14c686 <get_collation_isdeterministic.cold> till 0x14c6bb (but without it) without any stack setup for that .cold. But backtrace() just captured the elog/ereport (cold) at the end of 0x14c6bb instead, so if I take that exact address from backtrace_functions output as it is ("DELETE(+0x14c6bb)"), it also shows WRONG function (just like addr2line): (gdb) disassemble 0x14c6bb Dump of assembler code for function get_language_name: Address range 0x5c7780 to 0x5c77ee: [..] 0x00000000005c77ed <+109>: ret Address range 0x14c6bb to 0x14c6f0: 0x000000000014c6bb <-4698309>: xor %esi,%esi [..] 0x000000000014c6eb <-4698261>: call 0x5df100 <errfinish> End of assembler dump. so this is wrong (as are failing in "get_collation_isdeterministic" not in "get_language_name"). I was very confused at the beginning with the main question being: why are we compiling elog/ereport() so that it is incompatible with backtrace? When looking for it I've found two threads [1] by David that were actually helpful in understanding that this was done for performance (unlikley(), cold attribute and similiar type of discussions). Now my thought is that for >= ERROR in ereport_domain we could add something more (?) before pg_unreachable() that would help the backtrace to resolve the symbol correctly If I try non-portable PoC with x86 nop instruction: --- a/src/include/utils/elog.h +++ b/src/include/utils/elog.h @@ -145,8 +145,10 @@ struct Node; errstart_cold(elevel, domain) : \ errstart(elevel, domain)) \ __VA_ARGS__, errfinish(__FILE__, __LINE__, __func__); \ - if (__builtin_constant_p(elevel) && (elevel) >= ERROR) \ + if (__builtin_constant_p(elevel) && (elevel) >= ERROR) { \ + __asm__ __volatile__("nop"); \ pg_unreachable(); \ + } \ it partially works and the address can be now properly resolved! Somewhat backtrace() still is unable to lookup the to do so in log itself: postgres: postgres test1 [local] DELETE(+0x15f84c) [0x55703bf5284c] postgres: postgres test1 [local] DELETE(RI_FKey_cascade_del+0x2b0) [0x55703c34c850] but at least: postgres@hive:~$ addr2line -e /usr/pgsql17/bin/postgres -a -f 0x15f84c 0x000000000015f84c get_collation_isdeterministic /git/postgres/src/backend/utils/cache/lsyscache.c:1062 (discriminator 4) postgres@hive:~$ in gdb it will point right to our new nop: 0x000000000015f840 <-4490944>: lea 0x608306(%rip),%rdi # 0x767b4d 0x000000000015f847 <-4490937>: call 0x5be1d0 <errfinish> 0x000000000015f84c <-4490932>: nop Thoughts? Does it make sense to post a patch (for pg18?)? How to do it in $arch-independent way? I've tried also to try to find a way with e.g. -rdynamic to show that real function name, but it looks like without some more serious unwind library it seems unrealistic (?) to resolve that get_collation_isdeterministic.cold -J. [1] - https://www.postgresql.org/message-id/flat/CAApHDvoWoxtbWiqZxrhO%2Bi9NoG56AWHDzuDDD%2B1OEf4PxzFhig%40mail.gmail.com#611566bd6fd06f27e51dbc3148a673d3 [2] - https://www.postgresql.org/message-id/flat/CAKJS1f8yqRW3qx2CO9r4bqqvA2Vx68%3D3awbh8CJWTP9zXeoHMw%40mail.gmail.com