[PR] Reduce SSH connection churn in `SSHRemoteJobOperator` under high fan-out [airflow]

via GitHub Fri, 05 Jun 2026 18:19:51 -0700


kaxil opened a new pull request, #68115:
URL: https://github.com/apache/airflow/pull/68115


   `SSHRemoteJobOperator` currently opens a brand-new SSH connection for every 
remote command. A large `.expand()` fan-out against a single host drives the 
connection rate past the remote `sshd` `MaxStartups` limit, which drops 
connections. This showed up in load testing two ways: submit-time failures 
(`paramiko ... Error reading SSH protocol banner`) and job directories left 
behind on the remote host.
   
   This PR cuts the connection rate at the source and hardens the retry and 
cleanup paths. Defaults preserve existing behavior.
   
   ## Root cause
   
   The banner error appears within a few milliseconds of the connect attempt 
and the underlying exception is ` **EOFError**: the server closed the socket 
before sending its banner`. That is `sshd` `MaxStartups` (default `10:30:100`) 
throttling concurrent unauthenticated connections, not a slow banner, so 
raising `banner_timeout` does not help.
   
   The trigger reconnected 2-3 times per poll (completion check, log size, log 
read), per task, for the whole job, so a 200-way fan-out sustained a very high 
handshake rate against one server. The retries failed for the same reason: 
synchronized across the fleet while the rest of the fleet kept the server 
saturated. Cleanup ran only on completion and silently swallowed its own 
dropped connection, orphaning the directory.
   
   ## What changed
   
   - Trigger holds a single connection for the whole poll loop instead of 
reconnecting per command, and reconnects with jittered backoff (bounded by 
`max_reconnect_attempts`) if it drops. `asyncssh.Error` is now treated as a 
reconnectable failure. The reconnect budget resets only after a full successful 
poll, so a connection that handshakes but whose command channel keeps failing 
(for example `ChannelOpenError` under `MaxSessions`) still exhausts the budget 
instead of deferring forever.
   - Operator reuses one connection for OS detection and submission (was two).
   - Cleanup retries (`cleanup_retries`) instead of orphaning the directory on 
a single dropped connection.
   - New `conn_retry_attempts` on the hook and operator so the initial submit 
burst tolerates transient refusals.
   - `SSHHookAsync` sets a keepalive on the now long-lived trigger connection.
   
   ## Measured against a real OpenSSH container
   
   | | before | after |
   |---|---|---|
   | SSH connections for one ~9s job | 20 | 5 |
   | `MaxStartups` drops, 120-way simultaneous submit | 71 | 36 |
   
   ## New parameters (all optional)
   
   - `conn_retry_attempts`: operator default 5, hook default 3 (unchanged 
behavior at 3).
   - `cleanup_retries`: default 3.
   - `command_timeout`: default 30.0, and `max_reconnect_attempts`: default 5, 
both forwarded from the operator to the trigger.
   
   ## Gotchas
   
   - The direct lever for very high fan-out is raising `MaxStartups` (and 
`MaxSessions`) on the SSH server. The operator docs now call this out.
   - Cleanup only runs when a job reaches completion, so killed or timed-out 
tasks can still leave a directory behind. A server-side TTL reaper (for example 
`systemd-tmpfiles`) is recommended for those, and the docs mention it.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[PR] Reduce SSH connection churn in `SSHRemoteJobOperator` under high fan-out [airflow]

Reply via email to