turboFei opened a new pull request, #56262:
URL: https://github.com/apache/spark/pull/56262

   ### What changes were proposed in this pull request?
   
   When running Spark on Kubernetes, driver pod failures are difficult to 
diagnose because error details are only available in `kubectl logs`, which may 
have rotated by the time an operator investigates.
   
   Kubernetes provides a standard mechanism for surfacing container exit 
reasons: writing to `/dev/termination-log` (max 4096 bytes). This message is 
preserved in pod status and visible via `kubectl describe pod → Last State > 
Message`, even after the pod is deleted.
   
   This patch adds termination log support to the Spark driver entrypoint:
   
   - On non-zero exit, the last 4KB of driver stderr is written to 
`/dev/termination-log`
   - stderr is streamed through an **awk ring buffer via named pipe**: fully 
visible in `kubectl logs` in real-time, with no unbounded `/tmp` disk growth 
regardless of how long the driver runs
   - **SIGTERM is forwarded to tini** so Kubernetes graceful shutdown still 
reaches the Spark driver (without `exec`, bash as PID 1 ignores SIGTERM by 
default)
   - Only the `driver` case is affected; `executor` and pass-through modes are 
unchanged
   - The `TERMINATION_LOG` env var can override the path (useful for local 
testing)
   
   **How it works:**
   ```
   driver stderr ──→ named pipe ──→ awk ring buffer (memory only, max 4KB)
                                         └──→ container stderr (kubectl logs, 
unchanged)
   on exit code != 0:
     cat last4k_buffer → /dev/termination-log
   ```
   
   ### Why are the changes needed?
   
   Without this change, operators investigating a failed Spark driver pod on 
Kubernetes must:
   1. Have captured `kubectl logs` before the pod was cleaned up, or
   2. Have external log aggregation configured
   
   With this change, `kubectl describe pod <driver-pod>` always shows the last 
4KB of driver stderr in the `Last State > Message` field, providing immediate 
diagnostic context.
   
   ### Does this PR introduce _any_ user-facing change?
   
   Yes — `kubectl describe pod` for a failed Spark driver pod will now populate 
the `Last State > Message` field with the last 4KB of driver stderr.
   
   ### How was this patch tested?
   
   - Verified locally that the awk ring buffer correctly captures the last 4096 
bytes of stderr output (tested with >56KB of simulated log output)
   - Verified that `kubectl logs` output is unchanged (stderr still streams in 
real-time)
   - Verified that SIGTERM forwarding works correctly for graceful shutdown
   - Verified that named pipe is cleaned up on both normal and abnormal exit 
(via `trap ... EXIT`)
   
   ### Was this patch authored or co-authored by a generative AI?
   
   No


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to