masaori335 opened a new pull request, #13296:
URL: https://github.com/apache/trafficserver/pull/13296

   ## Problem
   
   On a normal `traffic_server` shutdown (e.g. SIGTERM), `traffic_crashlog` can
   spuriously emit an empty crash log against the already-exiting process.
   
   ## Background
   
   At startup, `traffic_server` forks `traffic_crashlog --wait`, which parks
   itself:
   
   ```cpp
   EnableDeathSignal(SIGCONT);   // prctl(PR_SET_PDEATHSIG, SIGCONT)
   kill(getpid(), SIGSTOP);
   ```
   
   The helper is therefore resumed by `SIGCONT` in **two** indistinguishable
   situations:
   
   1. **A real crash** — the crash signal handler `crash_logger_invoke()` 
explicitly sends `SIGCONT`.
   2. **Any other parent exit** — the kernel's parent-death signal
      (`PR_SET_PDEATHSIG`, set to `SIGCONT` in #11614) resumes it.
   
   To tell these apart the helper used `getppid() != parent`. That check is
   racy: `PR_SET_PDEATHSIG` fires when the **thread that forked the helper**
   terminates, which during a graceful shutdown happens while the rest of
   `traffic_server` — and thus its PID — is still alive. So `getppid()` still
   equals the original parent, the guard doesn't trip, and the helper proceeds
   to write a crash log for a process that merely exited.
   
   ## Fix
   
   Replace the `getppid()` guess with an explicit handshake over the socket
   that already connects the two processes:
   
   - `crash_logger_invoke()` writes the crashing signal number to the helper
     **before** the existing (Linux-only) thread payload. This is the only path
     that writes it.
   - The helper, after being resumed, does a blocking `read()`:
     - **data** ⇒ a real crash ⇒ proceed and write the log;
     - **EOF / short read** ⇒ the parent is gone with no notification ⇒ exit
       quietly.
   - `FD_CLOEXEC` is set on the parent's end of the socket so the write end is
     solely owned by `traffic_server`; that guarantees the helper sees EOF on
     parent death rather than blocking on an end inherited by a
     `popen()`/`system()` child.
   
   `SIGCONT` (rather than `SIGKILL`) is retained deliberately: it lets the
   helper survive a spurious/early wake, `read()` for the answer, and stay ready
   for a later real crash — whereas `SIGKILL` would prematurely kill the helper
   on the same forking-thread race and silently drop crash logging (and was
   already moved away from in #11614).
   
   ## Testing
   
   - New regression test
     `tests/gold_tests/shutdown/crashlog_no_false_positive.test.py`: arms the
     helper, lets AuTest SIGTERM `traffic_server`, and asserts the
     graceful-shutdown path ran (`received exit signal, shutting down`) while no
     crash log was produced (`crashlog started` / `wrote crash log` excluded).
   - Existing `crash_test` (real crash → crash log produced) still passes on the
     parts this code governs.
   
   ## Notes
   
   `PR_SET_PDEATHSIG` is Linux-only, so the false positive itself only occurs on
   Linux; on macOS/FreeBSD the helper is simply never woken on a clean exit. The
   regression test still confirms the "armed but quiet" behavior there.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to