kaxil opened a new pull request, #68644:
URL: https://github.com/apache/airflow/pull/68644

   When an `SSHRemoteJobOperator` task is cancelled, `on_kill` ran `kill $(cat 
pid)` against the recorded PID. That PID is the detached **wrapper subshell**, 
not the user command beneath it, and `kill <pid>` signals only that one 
process. So the wrapper died but the real remote command kept running as an 
orphan, and because the wrapper is what writes the `exit_code` file, that file 
never appeared, leaving the job stuck until the trigger's `timeout` elapsed.
   
   This makes cancellation tear down the whole job tree:
   
   - The wrapper launches under `setsid` (when available), so it is its own 
session/process-group leader. The recorded PID (`$!`) is then also the 
process-group id.
   - `on_kill` signals the group (`kill -TERM -<pgid>`), so the user command 
and anything it spawned die together.
   - Without `setsid`, it falls back to the previous single-PID kill.
   
   ## Design rationale
   
   - **`setsid`, not `set -m`:** job-control mode is unreliable without a TTY, 
and submission uses `get_pty=False`. `setsid` creates a new session/group 
without a TTY. Verified that `setsid bash -c '...' &` does not fork an extra 
level, so `$!` is the leader PID and the PGID, and the synchronous PID write is 
unchanged from before.
   - **PID `> 1` guard:** a corrupt or partial PID of `0`/`1` would turn `kill 
-<pid>` into a broadcast to every process the SSH account can signal. `[ "$p" 
-gt 1 ]` blocks `0`, `1`, empty, and non-numeric values.
   - **Windows:** `Stop-Process -Id` kills only the leaf, switched to `taskkill 
/PID ... /T /F` (process plus child tree). Also renamed `$pid` to `$procId`, 
since `$PID` is a read-only automatic variable in PowerShell and the old 
assignment would have thrown.
   
   ## Scope and gotchas
   
   - The operator already requires a bash-compatible login shell (the wrapper 
uses `set -o pipefail`/`echo -n`); this PR does not change that. Only the kill 
string needs to be POSIX-portable, and it is dash/ash-clean.
   - Hosts without `setsid` (some macOS/BSD images) keep the previous 
single-PID behavior, so children there are still orphaned. Linux remotes (the 
common case) are fully covered.
   
   ## Known limitations
   
   - The Windows `taskkill /T` path has no executable test (no Windows CI 
here), though it is the standard tree-kill.
   
   Related: apache/airflow#63355 (the core defer/heartbeat race that can 
spuriously trigger this `on_kill` path).
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to