masaori335 opened a new pull request, #13296:
URL: https://github.com/apache/trafficserver/pull/13296
## Problem
On a normal `traffic_server` shutdown (e.g. SIGTERM), `traffic_crashlog` can
spuriously emit an empty crash log against the already-exiting process.
## Background
At startup, `traffic_server` forks `traffic_crashlog --wait`, which parks
itself:
```cpp
EnableDeathSignal(SIGCONT); // prctl(PR_SET_PDEATHSIG, SIGCONT)
kill(getpid(), SIGSTOP);
```
The helper is therefore resumed by `SIGCONT` in **two** indistinguishable
situations:
1. **A real crash** — the crash signal handler `crash_logger_invoke()`
explicitly sends `SIGCONT`.
2. **Any other parent exit** — the kernel's parent-death signal
(`PR_SET_PDEATHSIG`, set to `SIGCONT` in #11614) resumes it.
To tell these apart the helper used `getppid() != parent`. That check is
racy: `PR_SET_PDEATHSIG` fires when the **thread that forked the helper**
terminates, which during a graceful shutdown happens while the rest of
`traffic_server` — and thus its PID — is still alive. So `getppid()` still
equals the original parent, the guard doesn't trip, and the helper proceeds
to write a crash log for a process that merely exited.
## Fix
Replace the `getppid()` guess with an explicit handshake over the socket
that already connects the two processes:
- `crash_logger_invoke()` writes the crashing signal number to the helper
**before** the existing (Linux-only) thread payload. This is the only path
that writes it.
- The helper, after being resumed, does a blocking `read()`:
- **data** ⇒ a real crash ⇒ proceed and write the log;
- **EOF / short read** ⇒ the parent is gone with no notification ⇒ exit
quietly.
- `FD_CLOEXEC` is set on the parent's end of the socket so the write end is
solely owned by `traffic_server`; that guarantees the helper sees EOF on
parent death rather than blocking on an end inherited by a
`popen()`/`system()` child.
`SIGCONT` (rather than `SIGKILL`) is retained deliberately: it lets the
helper survive a spurious/early wake, `read()` for the answer, and stay ready
for a later real crash — whereas `SIGKILL` would prematurely kill the helper
on the same forking-thread race and silently drop crash logging (and was
already moved away from in #11614).
## Testing
- New regression test
`tests/gold_tests/shutdown/crashlog_no_false_positive.test.py`: arms the
helper, lets AuTest SIGTERM `traffic_server`, and asserts the
graceful-shutdown path ran (`received exit signal, shutting down`) while no
crash log was produced (`crashlog started` / `wrote crash log` excluded).
- Existing `crash_test` (real crash → crash log produced) still passes on the
parts this code governs.
## Notes
`PR_SET_PDEATHSIG` is Linux-only, so the false positive itself only occurs on
Linux; on macOS/FreeBSD the helper is simply never woken on a clean exit. The
regression test still confirms the "armed but quiet" behavior there.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]