kaxil opened a new pull request, #68644: URL: https://github.com/apache/airflow/pull/68644
When an `SSHRemoteJobOperator` task is cancelled, `on_kill` ran `kill $(cat pid)` against the recorded PID. That PID is the detached **wrapper subshell**, not the user command beneath it, and `kill <pid>` signals only that one process. So the wrapper died but the real remote command kept running as an orphan, and because the wrapper is what writes the `exit_code` file, that file never appeared, leaving the job stuck until the trigger's `timeout` elapsed. This makes cancellation tear down the whole job tree: - The wrapper launches under `setsid` (when available), so it is its own session/process-group leader. The recorded PID (`$!`) is then also the process-group id. - `on_kill` signals the group (`kill -TERM -<pgid>`), so the user command and anything it spawned die together. - Without `setsid`, it falls back to the previous single-PID kill. ## Design rationale - **`setsid`, not `set -m`:** job-control mode is unreliable without a TTY, and submission uses `get_pty=False`. `setsid` creates a new session/group without a TTY. Verified that `setsid bash -c '...' &` does not fork an extra level, so `$!` is the leader PID and the PGID, and the synchronous PID write is unchanged from before. - **PID `> 1` guard:** a corrupt or partial PID of `0`/`1` would turn `kill -<pid>` into a broadcast to every process the SSH account can signal. `[ "$p" -gt 1 ]` blocks `0`, `1`, empty, and non-numeric values. - **Windows:** `Stop-Process -Id` kills only the leaf, switched to `taskkill /PID ... /T /F` (process plus child tree). Also renamed `$pid` to `$procId`, since `$PID` is a read-only automatic variable in PowerShell and the old assignment would have thrown. ## Scope and gotchas - The operator already requires a bash-compatible login shell (the wrapper uses `set -o pipefail`/`echo -n`); this PR does not change that. Only the kill string needs to be POSIX-portable, and it is dash/ash-clean. - Hosts without `setsid` (some macOS/BSD images) keep the previous single-PID behavior, so children there are still orphaned. Linux remotes (the common case) are fully covered. ## Known limitations - The Windows `taskkill /T` path has no executable test (no Windows CI here), though it is the standard tree-kill. Related: apache/airflow#63355 (the core defer/heartbeat race that can spuriously trigger this `on_kill` path). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
