Hi Tom and -hackers! On Thu, Mar 28, 2024 at 7:36 PM Tom Lane <[email protected]> wrote: > > Jakub Wartak <[email protected]> writes: > > While chasing some other bug I've learned that backtrace_functions > > might be misleading with top elog/ereport() address. > > That was understood from the beginning: this type of backtrace is > inherently pretty imprecise, and I doubt there is much that can > be done to make it better. IIRC the fundamental problem is it only > looks at global symbols, so static functions inherently defeat it. > It was argued that this is better than nothing, which is true, but > you have to take the info with a mountain of salt.
OK, point taken, thanks for describing the limitations, still I find
backtrace_functions often the best thing we have primarily due its
simplicity and ease of use (for customers). I found out simplest
portable way to generate NOP (or any other instruction that makes the
problem go away):
with the reproducer, psql returns:
psql:ri-collation-bug-example.sql:48: error: ERROR: XX000: cache
lookup failed for collation 0
LOCATION: get_collation_isdeterministic, lsyscache.c:1062
logfile without patch:
2024-05-07 09:05:43.043 CEST [44720] ERROR: cache lookup failed for collation 0
2024-05-07 09:05:43.043 CEST [44720] BACKTRACE:
postgres: postgres postgres [local] DELETE(+0x155883) [0x55ce5a86a883]
postgres: postgres postgres [local]
DELETE(RI_FKey_cascade_del+0x2b0) [0x55ce5ac6dfd0]
postgres: postgres postgres [local] DELETE(+0x2d488b) [0x55ce5a9e988b]
[..]
$ addr2line -e /usr/pgsql18/bin/postgres -a -f 0x155883
0x0000000000155883
get_constraint_type.cold <<<<<< so it's wrong as the real function
should be get_collation_isdeterministic
logfile with attached patch:
2024-05-07 09:11:06.596 CEST [51185] ERROR: cache lookup failed for collation 0
2024-05-07 09:11:06.596 CEST [51185] BACKTRACE:
postgres: postgres postgres [local] DELETE(+0x168bf0) [0x560e1cdfabf0]
postgres: postgres postgres [local]
DELETE(RI_FKey_cascade_del+0x2b0) [0x560e1d200c00]
postgres: postgres postgres [local] DELETE(+0x2e90fb) [0x560e1cf7b0fb]
[..]
$ addr2line -e /usr/pgsql18/bin/postgres -a -f 0x168bf0
0x0000000000168bf0
get_collation_isdeterministic.cold <<< It's ok with the patch
NOTE: in case one will be testing this: one cannot ./configure with
--enable-debug as it prevents the compiler optimizations that actually
end up with the ".cold" branch optimizations that cause backtrace() to
return the wrong address.
> I recall speculating about whether we could somehow invoke gdb
> to get a more comprehensive and accurate backtrace, but I don't
> really have a concrete idea how that could be made to work.
Oh no, I'm more about micro-fix rather than doing some bigger
revolution. The patch simply adds this one instruction in macro, so
that now backtrace_functions behaves better:
0x0000000000773d28 <+105>: call 0x79243f <errfinish>
0x0000000000773d2d <+110>: movzbl -0x12(%rbp),%eax << this ends
up being added by the patch
0x0000000000773d31 <+114>: call 0xdc1a0 <abort@plt>
I'll put that as for PG18 in CF, but maybe it could be backpatched too
- no hard opinion on that (I don't see how that ERROR/FATAL path could
cause any performance regressions)
-J.
v1-0001-Tweak-ereport-so-that-it-generates-proper-address.patch
Description: Binary data
