Thanks Kaxil, Amogh. This is really helpful feedback. My earlier POC in #60618 was along these lines; a dedicated TestConnection workload type with its own scheduler dispatch, worker side timeout enforcement, and a DB backed state machine (PENDING → RUNNING → SUCCESS/FAILED). I moved to the ExecutorCallback-based approach in #62343 to reduce the scope, but it sounds like the dedicated workload was heading towards the right direction. That said, #60618 didn't address several of the concerns raised here and it had no queue routing (hardcoded first executor), no concurrency budget, and no reaper for stuck requests. So this isn't just a revert; it's building on that foundation with the operational controls you're describing. And based on the discussion, here is the scope I plan to implement in the updated PR:
1. Dedicated ConnectionTest workload: not transported as ExecutorCallback 2. Save & Test flow: connection must be saved; worker fetches via connection_id through secrets 3. Explicit routing: optional queue parameter on the API, configuration driven default. Team based routing for multi team setups as a natural follow up once that infrastructure lands (per Jarek's point that team is a connection attribute and routing should happen automatically) 4. Dispatch budget: configurable concurrency cap so connection tests can't starve task execution 5. Strict timeout + stale-work reaper: worker side timeout enforcement, scheduler side reaper for stuck RUNNING requests 6. Fail fast on unsupported executors: upfront validation rather than runtime failure If there are no objections or anything I've missed, I'll update #62343 accordingly. Anish On Thu, Feb 26, 2026 at 1:52 PM Kaxil Naik <[email protected]> wrote: > > +1 on moving connection testing off the API server and onto workers for the > security/isolation reasons already discussed. > > I also think the queue concern is critical. > > Queues are a core routing abstraction in Airflow for directing work to > specific worker populations (runtime/network/location). For some of our > enterprise customer's deployments, some workers can reach certain > third-party systems while others cannot, so queue placement directly > affects the correctness of a connection test result. > > My main concern is architectural: connection testing is currently being > transported as an ExecutorCallback. Conceptually, connection testing is not > a callback; it is an ad-hoc workload. That mismatch creates two concrete > risks in the current model: > > 1) Routing correctness: > Default/fallback queue behavior can run tests in the wrong worker > environment > > 2) Fairness and isolation: > Callbacks are correctness-critical for state progression. Connection > tests are user-initiated and can hang; they should not compete in the same > priority lane. > > I agree callbacks are critical for correctness (they drive state/failure > progression), so we should preserve their ability to run promptly. At the > same time, connection testing is different in nature (user-initiated ad-hoc > workload, potentially long-running/hanging) and should not compete in the > same unbounded priority lane. A hanging connection test can occupy a worker > slot for a long time, so this needs strict timeout/reaper behavior in > addition to dispatch controls. > > So my suggestion is to require a dedicated ConnectionTest workload before > merge, with: > - explicit routing semantics (queue/executor/team) -- and the defaults can > be config driven, > - dedicated dispatch budget/concurrency control, > - strict timeout + stale-work reaper, > - fail-fast behavior on unsupported executors. > > This keeps the security goal while avoiding routing regressions and > scheduler starvation risk, and keeps callback lanes reserved for callback > semantics. > > Regards, > Kaxil > > On Thu, 26 Feb 2026 at 07:59, Amogh Desai <[email protected]> wrote: > > > Good discussion, Anish. +1 on moving connection testing to workers -- > > running user supplied > > driver code on API server is the right thing to move away from. > > > > Strongly agree with Ephraim and Jarek here. If we require the connection to > > be saved before testing > > (so the worker can fetch it via connection_id through the standard secrets > > path), the UI button should > > make that explicit. "Save & Test" is exactly that and it will avoid any > > surprise of overwriting an existing > > connection while *just testing :)* > > > > Re queue routing: I think an optional queue parameter on the test > > connection API should be part of the > > initial implementation itself rather than follow up. The default can be > > *default*, which covers the common case. > > But there are some other scenarios too: > > > > * Workers on different queues deployed in different network zones meaning > > that the test would pass or fail > > depending on which queue makes it run. > > > > * Setups where different queues have different secrets backend, the worker > > might not even be able to resolve the > > connection at all. > > > > The cost of having an optional queue parameter is kinda low and I think it > > avoids shipping something that > > gives wrong answers in these setups, so if the cost of it is relatively > > low, can we consider doing it? > > > > > > > > Thanks & Regards, > > Amogh Desai > > > > > > On Tue, Feb 24, 2026 at 10:29 PM Jarek Potiuk <[email protected]> wrote: > > > > > > Maybe changing the button to 'save & test' would suffix. > > > > > > +10 > > > > > > On Tue, Feb 24, 2026 at 1:17 PM Ephraim Anierobi < > > > [email protected]> > > > wrote: > > > > > > > Cool idea. > > > > > > > > On saving the connection automatically, I think we should make it > > > explicit > > > > in the UI that testing the connection will save the connection. This > > will > > > > help the users to know that they are not just testing but also creating > > > the > > > > connection with the entered credentials. Without this being explicit, I > > > > think users may unknowingly replace a connection while just trying if a > > > new > > > > connection would work. > > > > > > > > Maybe changing the button to 'save & test' would suffix. > > > > > > > > On 2026/02/22 03:50:11 Anish Giri wrote: > > > > > Hi all, > > > > > > > > > > I'd like to discuss moving connection testing off the API server and > > > > > onto workers. Jarek suggested this direction in a comment on #59643 > > > > > [1], and I think the Callback infrastructure being built for running > > > > > callbacks on executors is the right foundation for it. > > > > > > > > > > Since 2.7.0, test_connection has been disabled by default (#32052). > > > > > Running it on the API server has two problems: the API server > > > > > shouldn't be executing user-supplied driver code (Jarek described the > > > > > ODBC/JDBC risks in detail on #59643), and workers typically have > > > > > network access to external systems that API servers don't, so test > > > > > results from the API server can be misleading. > > > > > > > > > > Ramit's generic Callback model (#54796 [2]) and Ferruzzi's > > > > > in-progress executor dispatch (#61153 [3]) together give us most of > > > > > what's needed. The flow would be: > > > > > > > > > > 1. UI calls POST /connections/test > > > > > 2. API server Fernet-encrypts the connection URI, creates an > > > > > ExecutorCallback pointing to the test function, returns an ID > > > > > 3. Scheduler dispatch loop (from #61153) picks it up, sends it > > > > > to the executor > > > > > 4. Worker decrypts the URI, builds a transient Connection, calls > > > > > test_connection(), reports result through the callback path > > > > > 5. UI polls GET /connections/test/{id} until it gets a terminal > > > > > state > > > > > > > > > > The connection-testing-specific code would be small: a POST endpoint > > > > > to queue the test, a GET endpoint to poll for results, and the worker > > > > > function that decrypts and runs test_connection(). > > > > > > > > > > One thing I noticed: #61153's _enqueue_executor_callbacks currently > > > > > requires dag_run_id in the callback data dict, and > > ExecuteCallback.make > > > > > needs a DagRun for bundle info. Connection tests don't have a DagRun. > > > > > It would be a small change to make that optional. The dispatch query > > > > > itself is already generic (selects all PENDING ExecutorCallbacks). I > > > > > can take a look at decoupling that if it would be useful. > > > > > > > > > > A couple of other open questions: > > > > > > > > > > 1. The connection test needs to store an encrypted URI, conn_type, > > and > > > > > some timestamps. Is the Callback.data JSON column the right place > > > > > for that, or does it warrant its own small table? > > > > > > > > > > 2. Stale requests: if a worker crashes mid-test, the record stays > > > > > in a non-terminal state. Should there be a scheduler-side reaper > > > > > similar to zombie task detection, or is client-side timeout (60s > > > > > in the UI) enough? > > > > > > > > > > I explored this earlier in #60618 [4] with a self-contained > > > > > implementation. Now that the ExecutorCallback dispatch is taking > > shape > > > > > in #61153, building on top of will be in right direction. > > > > > > > > > > Thoughts? > > > > > > > > > > Anish > > > > > > > > > > [1] https://github.com/apache/airflow/pull/59643 > > > > > [2] https://github.com/apache/airflow/pull/54796 > > > > > [3] https://github.com/apache/airflow/pull/61153 > > > > > [4] https://github.com/apache/airflow/pull/60618 > > > > > > > > > > --------------------------------------------------------------------- > > > > > To unsubscribe, e-mail: [email protected] > > > > > For additional commands, e-mail: [email protected] > > > > > > > > > > > > > > > > > > --------------------------------------------------------------------- > > > > To unsubscribe, e-mail: [email protected] > > > > For additional commands, e-mail: [email protected] > > > > > > > > > > > > > --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
