kaxil opened a new pull request, #68115: URL: https://github.com/apache/airflow/pull/68115
`SSHRemoteJobOperator` currently opens a brand-new SSH connection for every remote command. A large `.expand()` fan-out against a single host drives the connection rate past the remote `sshd` `MaxStartups` limit, which drops connections. This showed up in load testing two ways: submit-time failures (`paramiko ... Error reading SSH protocol banner`) and job directories left behind on the remote host. This PR cuts the connection rate at the source and hardens the retry and cleanup paths. Defaults preserve existing behavior. ## Root cause The banner error appears within a few milliseconds of the connect attempt and the underlying exception is ` **EOFError**: the server closed the socket before sending its banner`. That is `sshd` `MaxStartups` (default `10:30:100`) throttling concurrent unauthenticated connections, not a slow banner, so raising `banner_timeout` does not help. The trigger reconnected 2-3 times per poll (completion check, log size, log read), per task, for the whole job, so a 200-way fan-out sustained a very high handshake rate against one server. The retries failed for the same reason: synchronized across the fleet while the rest of the fleet kept the server saturated. Cleanup ran only on completion and silently swallowed its own dropped connection, orphaning the directory. ## What changed - Trigger holds a single connection for the whole poll loop instead of reconnecting per command, and reconnects with jittered backoff (bounded by `max_reconnect_attempts`) if it drops. `asyncssh.Error` is now treated as a reconnectable failure. The reconnect budget resets only after a full successful poll, so a connection that handshakes but whose command channel keeps failing (for example `ChannelOpenError` under `MaxSessions`) still exhausts the budget instead of deferring forever. - Operator reuses one connection for OS detection and submission (was two). - Cleanup retries (`cleanup_retries`) instead of orphaning the directory on a single dropped connection. - New `conn_retry_attempts` on the hook and operator so the initial submit burst tolerates transient refusals. - `SSHHookAsync` sets a keepalive on the now long-lived trigger connection. ## Measured against a real OpenSSH container | | before | after | |---|---|---| | SSH connections for one ~9s job | 20 | 5 | | `MaxStartups` drops, 120-way simultaneous submit | 71 | 36 | ## New parameters (all optional) - `conn_retry_attempts`: operator default 5, hook default 3 (unchanged behavior at 3). - `cleanup_retries`: default 3. - `command_timeout`: default 30.0, and `max_reconnect_attempts`: default 5, both forwarded from the operator to the trigger. ## Gotchas - The direct lever for very high fan-out is raising `MaxStartups` (and `MaxSessions`) on the SSH server. The operator docs now call this out. - Cleanup only runs when a job reaches completion, so killed or timed-out tasks can still leave a directory behind. A server-side TTL reaper (for example `systemd-tmpfiles`) is recommended for those, and the docs mention it. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
