https://gcc.gnu.org/bugzilla/show_bug.cgi?id=123874
Bug ID: 123874
Summary: Incorrect interception of deprecated symbols causes
crash under sanitizers
Product: gcc
Version: 16.0
Status: UNCONFIRMED
Severity: normal
Priority: P3
Component: sanitizer
Assignee: unassigned at gcc dot gnu.org
Reporter: kirelagin at gmail dot com
CC: dodji at gcc dot gnu.org, dvyukov at gcc dot gnu.org,
jakub at gcc dot gnu.org, kcc at gcc dot gnu.org
Target Milestone: ---
(I am focusing on ASAN in the description but, I believe, all sanitizers are
affected the same way.)
(I think LLVM is affected as well. I have confirmed that clang 18.1 has the
same problem, although I did not check the code in their master.)
At the core is a fundamental limitation of how interception interacts with
loading objects via dlopen (e.g. plugins).
Here is a simplified reproducer:
```c
// main.c
#include <assert.h>
#include <dlfcn.h>
#include <stdio.h>
int main() {
void *const handle = dlopen("./plugin.so", RTLD_LAZY);
assert(handle);
void *foo = dlsym(handle, "foo");
assert(foo);
((void*(*)())foo)();
return 0;
}
```
```c
// plugin.c
#define _GNU_SOURCE
#include <crypt.h>
#include <stdio.h>
void foo() {
struct crypt_data data;
data.initialized = 0;
crypt_r("hello", "world", &data);
puts("OK");
}
```
```shell_session
$ gcc -lcrypt -fpic -shared plugin.c -o plugin.so
$ gcc main.c -o main
$ ./main
OK
$ gcc -fsanitize=address main.c -o main
$ ./main
AddressSanitizer:DEADLYSIGNAL
=================================================================
==2444535==ERROR: AddressSanitizer: SEGV on unknown address 0x000000000000 (pc
0x000000000000 bp 0x7fffffff36f0 sp 0x7fffffff2e88 T0)
==2444535==Hint: pc points to the zero page.
==2444535==The signal is caused by a READ memory access.
==2444535==Hint: address points to the zero page.
#0 0x0 (<unknown module>)
#1 0x7ffff6ee3167 in foo (plugin.so+0x1167)
AddressSanitizer can not provide additional info.
SUMMARY: AddressSanitizer: SEGV (<unknown module>)
==2444535==ABORTING
```
Here is why this is happening:
1. On startup, libasan resolves the corresponding real functions for its
interceptors.
2. The `crypt_r` function has been removed from glibc, so `dlsym` returns
nullptr.
3. plugin.so is loaded using `dlopen`; among its dependencies is `libcrypt.so`,
which provides the symbol now.
4. The function in the plugin tries to call `crypt_r`, the call is intercepted,
and `__interceptor_crypt_r` crashes calling nullptr.
This is GCC 13.3.0 from Nixpkgs.
This specific reproducer does not work with GCC 14, because the libcrypt
interceptors were removed at some point, but the issue at hand remains, as
there are interceptors for other deprecated functions.
The reproducer above is, naturally, a bit contrived, for simplicity. Here is an
example of a real-world scenario, which actually led to my investigation.
The code in `example.c` is the example from the bottom of `man getpwnam_r` (not
included here for brevity).
```shell_session
$ gcc example.c -o example
$ ./example bad-user
Not found
$ gcc -fsanitize=address example.c -o example
$ ./example bad-user
AddressSanitizer:DEADLYSIGNAL
=================================================================
==3387167==ERROR: AddressSanitizer: SEGV on unknown address 0x000000000000 (pc
0x000000000000 bp 0x7fffffffaaa0 sp 0x7fffffffa258 T0)
==3387167==Hint: pc points to the zero page.
==3387167==The signal is caused by a READ memory access.
==3387167==Hint: address points to the zero page.
#0 0x0 (<unknown module>)
#1 0x7ffff6ec35c1 in __yp_bind.part.0
(/nix/store/<hash>-libnsl-2.0.1/lib/libnsl.so.3+0x35c1)
#2 0x7ffff6ec3a5d in do_ypcall
(/nix/store/<hash>-libnsl-2.0.1/lib/libnsl.so.3+0x3a5d)
AddressSanitizer can not provide additional info.
SUMMARY: AddressSanitizer: SEGV (<unknown module>)
==3387167==ABORTING
```
This is running on a RHEL 8 machine, with NSS configured to use NIS for passwd.
The sequence of steps leading to the issue is essentially the same as in the
first example, just the interceptor is different:
1. The executable starts, libasan resolves the real functions for its
interceptors.
2. The `xdrstdio_create` function has been deprecated, so `dlsym` returns
nullptr.
3. When trying to locate the user, NSS loads `libnss_nis.so` as a plugin.
4. `libnss_nis.so` depends on `libnsl`, which depends on `libtirpc`, which is
where the deprecated function was moved to.
5. `libnsl.so` calls `xdrstdio_create`, `__interceptor_xdrstdio_create` tries
to call nullptr.
Full backtrace:
```text
(gdb) bt
#0 0x0000000000000000 in ?? ()
#1 0x00007ffff79783f9 in __interceptor_xdrstdio_create.part.0 () from
/nix/store/<hash>-gcc-13.3.0-lib/lib/libasan.so.8
#2 0x00007ffff6ec35c2 in __yp_bind.part.0 () from
/nix/store/<hash>-libnsl-2.0.1/lib/libnsl.so.3
#3 0x00007ffff6ec3a5e in do_ypcall () from
/nix/store/<hash>-libnsl-2.0.1/lib/libnsl.so.3
#4 0x00007ffff6ec3c69 in do_ypcall_tr () from
/nix/store/<hash>-libnsl-2.0.1/lib/libnsl.so.3
#5 0x00007ffff6ec4605 in yp_match () from
/nix/store/<hash>-libnsl-2.0.1/lib/libnsl.so.3
#6 0x00007ffff6ed1d57 in _nss_nis_getpwnam_r () from
/nix/store/<hash>-nss-libs/lib/libnss_nis.so.2
#7 0x00007ffff786bd23 in __getpwnam_r (name=<optimized out>,
resbuf=0x7ffff5800050, buffer=<optimized out>, buflen=1024, result=<optimized
out>) at ../nss/getXXbyYY_r.c:273
#8 0x00007ffff79dbb28 in getpwnam_r () from
/nix/store/<hash>-gcc-13.3.0-lib/lib/libasan.so.8
#9 0x00000000004013f6 in main ()
```
An additional layer of complexity is added here by the fact that this is only
crashing with recent glibc versions.
I did not bisect glibc, but my guess is that the change that makes the
difference is the fix of this bug
(https://sourceware.org/bugzilla/show_bug.cgi?id=14932), which landed in glibc
2.36.
```shell_session
(glibc 2.28) $ ./dlsym
dlsym(RTLD_DEFAULT, xdrstdio_create) => (nil)
dlsym(RTLD_NEXT, xdrstdio_create) => 0x7ffff7965180
dlvsym(RTLD_DEFAULT, xdrstdio_create) => 0x7ffff7965180
dlvsym(RTLD_NEXT, xdrstdio_create) => 0x7ffff7965180
(glibc 2.40) $ ./dlsym
dlsym(RTLD_DEFAULT, xdrstdio_create) => (nil)
dlsym(RTLD_NEXT, xdrstdio_create) => (nil)
dlvsym(RTLD_DEFAULT, xdrstdio_create) => 0x7ffff7f1c330
dlvsym(RTLD_NEXT, xdrstdio_create) => 0x7ffff7f1c330
```
So, on RHEL 8 and 9, this issue is hidden by a glibc bug, which causes the
“real” function to be incorrectly resolved to the deprecated symbol in glibc,
instead of the actual implementation in the library that it has been moved to.
Arguably, this is even worse than crashing, because this quietly replaces the
implementation of the function with one which the caller does not expect.
I do not have a RHEL 10 machine to test on, but, I assume, the code would crash
on RHEL 10 with NIS configured, so it is only a matter of time until enterprise
users, who rely on NIS, migrate to RHEL 10 and begin observing the problem.
One possible short-term solution would be to resolve the exact versions of
these functions.
This would work as expected for legacy binaries, however new binaries (e.g.
libnss_nis/libnsl) will end up with their calls being dispatched to unexpected
versions of the symbols under sanitizers.
Probably a better quick solution is to remove the interceptors and accept
potential false positive as a result.
Looking at the whole picture, this appears to be a fundamental limitation of
how interceptors currently work.
I see two potential fixes:
1. Intercept `dlopen` and re-scan the symbols for any new real symbols to
dispatch to (tricky).
2. Instead of eagerly resolving all real call targets on startup, postpone the
`dlsym` invocation until the first actual call for each symbol (slower?). Or
maybe, as a middle ground, re-resolve if it is nullptr at the time it is about
to be called.