Re: [DISCUSS] Move connection testing to workers

Jarek Potiuk Thu, 26 Feb 2026 22:48:37 -0800

Yep all good points there

On Fri, Feb 27, 2026, 01:32 Kaxil Naik <[email protected]> wrote:


> Sounds good, thanks. Feel free to handle them in separate PRs if needed.
>
>
> On Thu, 26 Feb 2026 at 22:27, Anish Giri <[email protected]> wrote:
>
> > Thanks Kaxil, Amogh. This is really helpful feedback.
> > My earlier POC in #60618 was along these lines; a dedicated
> > TestConnection workload type with its own scheduler dispatch, worker
> > side timeout enforcement, and a DB backed state machine (PENDING →
> > RUNNING → SUCCESS/FAILED). I moved to the ExecutorCallback-based
> > approach in #62343 to reduce the scope, but it sounds like the
> > dedicated workload was heading towards the right direction. That said,
> > #60618 didn't address several of the concerns raised here and it had
> > no queue routing (hardcoded first executor), no concurrency budget,
> > and no reaper for stuck requests. So this isn't just a revert; it's
> > building on that foundation with the operational controls you're
> > describing. And based on the discussion, here is the scope I plan to
> > implement in the updated PR:
> >
> > 1. Dedicated ConnectionTest workload: not transported as ExecutorCallback
> > 2. Save & Test flow: connection must be saved; worker fetches via
> > connection_id through secrets
> > 3. Explicit routing: optional queue parameter on the API,
> > configuration driven default. Team based routing for multi team setups
> > as a natural follow up once that infrastructure lands (per Jarek's
> > point that team is a connection attribute and routing should happen
> > automatically)
> > 4. Dispatch budget: configurable concurrency cap so connection tests
> > can't starve task execution
> > 5. Strict timeout + stale-work reaper: worker side timeout
> > enforcement, scheduler side reaper for stuck RUNNING requests
> > 6. Fail fast on unsupported executors: upfront validation rather than
> > runtime failure
> >
> > If there are no objections or anything I've missed, I'll update #62343
> > accordingly.
> >
> > Anish
> >
> > On Thu, Feb 26, 2026 at 1:52 PM Kaxil Naik <[email protected]> wrote:
> > >
> > > +1 on moving connection testing off the API server and onto workers for
> > the
> > > security/isolation reasons already discussed.
> > >
> > > I also think the queue concern is critical.
> > >
> > > Queues are a core routing abstraction in Airflow for directing work to
> > > specific worker populations (runtime/network/location). For some of our
> > > enterprise customer's deployments, some workers can reach certain
> > > third-party systems while others cannot, so queue placement directly
> > > affects the correctness of a connection test result.
> > >
> > > My main concern is architectural: connection testing is currently being
> > > transported as an ExecutorCallback. Conceptually, connection testing is
> > not
> > > a callback; it is an ad-hoc workload. That mismatch creates two
> concrete
> > > risks in the current model:
> > >
> > > 1) Routing correctness:
> > >    Default/fallback queue behavior can run tests in the wrong worker
> > > environment
> > >
> > > 2) Fairness and isolation:
> > >    Callbacks are correctness-critical for state progression. Connection
> > > tests are user-initiated and can hang; they should not compete in the
> > same
> > > priority lane.
> > >
> > > I agree callbacks are critical for correctness (they drive
> state/failure
> > > progression), so we should preserve their ability to run promptly. At
> the
> > > same time, connection testing is different in nature (user-initiated
> > ad-hoc
> > > workload, potentially long-running/hanging) and should not compete in
> the
> > > same unbounded priority lane. A hanging connection test can occupy a
> > worker
> > > slot for a long time, so this needs strict timeout/reaper behavior in
> > > addition to dispatch controls.
> > >
> > > So my suggestion is to require a dedicated ConnectionTest workload
> before
> > > merge, with:
> > > - explicit routing semantics (queue/executor/team) -- and the defaults
> > can
> > > be config driven,
> > > - dedicated dispatch budget/concurrency control,
> > > - strict timeout + stale-work reaper,
> > > - fail-fast behavior on unsupported executors.
> > >
> > > This keeps the security goal while avoiding routing regressions and
> > > scheduler starvation risk, and keeps callback lanes reserved for
> callback
> > > semantics.
> > >
> > > Regards,
> > > Kaxil
> > >
> > > On Thu, 26 Feb 2026 at 07:59, Amogh Desai <[email protected]>
> wrote:
> > >
> > > > Good discussion, Anish. +1 on moving connection testing to workers --
> > > > running user supplied
> > > > driver code on API server is the right thing to move away from.
> > > >
> > > > Strongly agree with Ephraim and Jarek here. If we require the
> > connection to
> > > > be saved before testing
> > > > (so the worker can fetch it via connection_id through the standard
> > secrets
> > > > path), the UI button should
> > > > make that explicit. "Save & Test" is exactly that and it will avoid
> any
> > > > surprise of overwriting an existing
> > > > connection while *just testing :)*
> > > >
> > > > Re queue routing: I think an optional queue parameter on the test
> > > > connection API should be part of the
> > > > initial implementation itself rather than follow up. The default can
> be
> > > > *default*, which covers the common case.
> > > > But there are some other scenarios too:
> > > >
> > > > * Workers on different queues deployed in different network zones
> > meaning
> > > > that the test would pass or fail
> > > > depending on which queue makes it run.
> > > >
> > > > * Setups where different queues have different secrets backend, the
> > worker
> > > > might not even be able to resolve the
> > > > connection at all.
> > > >
> > > > The cost of having an optional queue parameter is kinda low and I
> > think it
> > > > avoids shipping something that
> > > > gives wrong answers in these setups, so if the cost of it is
> relatively
> > > > low, can we consider doing it?
> > > >
> > > >
> > > >
> > > > Thanks & Regards,
> > > > Amogh Desai
> > > >
> > > >
> > > > On Tue, Feb 24, 2026 at 10:29 PM Jarek Potiuk <[email protected]>
> > wrote:
> > > >
> > > > > > Maybe changing the button to 'save & test' would suffix.
> > > > >
> > > > > +10
> > > > >
> > > > > On Tue, Feb 24, 2026 at 1:17 PM Ephraim Anierobi <
> > > > > [email protected]>
> > > > > wrote:
> > > > >
> > > > > > Cool idea.
> > > > > >
> > > > > > On saving the connection automatically, I think we should make it
> > > > > explicit
> > > > > > in the UI that testing the connection will save the connection.
> > This
> > > > will
> > > > > > help the users to know that they are not just testing but also
> > creating
> > > > > the
> > > > > > connection with the entered credentials. Without this being
> > explicit, I
> > > > > > think users may unknowingly replace a connection while just
> trying
> > if a
> > > > > new
> > > > > > connection would work.
> > > > > >
> > > > > > Maybe changing the button to 'save & test' would suffix.
> > > > > >
> > > > > > On 2026/02/22 03:50:11 Anish Giri wrote:
> > > > > > > Hi all,
> > > > > > >
> > > > > > > I'd like to discuss moving connection testing off the API
> server
> > and
> > > > > > > onto workers. Jarek suggested this direction in a comment on
> > #59643
> > > > > > > [1], and I think the Callback infrastructure being built for
> > running
> > > > > > > callbacks on executors is the right foundation for it.
> > > > > > >
> > > > > > > Since 2.7.0, test_connection has been disabled by default
> > (#32052).
> > > > > > > Running it on the API server has two problems: the API server
> > > > > > > shouldn't be executing user-supplied driver code (Jarek
> > described the
> > > > > > > ODBC/JDBC risks in detail on #59643), and workers typically
> have
> > > > > > > network access to external systems that API servers don't, so
> > test
> > > > > > > results from the API server can be misleading.
> > > > > > >
> > > > > > > Ramit's generic Callback model (#54796 [2]) and Ferruzzi's
> > > > > > > in-progress executor dispatch (#61153 [3]) together give us
> most
> > of
> > > > > > > what's needed. The flow would be:
> > > > > > >
> > > > > > > 1. UI calls POST /connections/test
> > > > > > > 2. API server Fernet-encrypts the connection URI, creates an
> > > > > > > ExecutorCallback pointing to the test function, returns an ID
> > > > > > > 3. Scheduler dispatch loop (from #61153) picks it up, sends it
> > > > > > > to the executor
> > > > > > > 4. Worker decrypts the URI, builds a transient Connection,
> calls
> > > > > > > test_connection(), reports result through the callback path
> > > > > > > 5. UI polls GET /connections/test/{id} until it gets a terminal
> > > > > > > state
> > > > > > >
> > > > > > > The connection-testing-specific code would be small: a POST
> > endpoint
> > > > > > > to queue the test, a GET endpoint to poll for results, and the
> > worker
> > > > > > > function that decrypts and runs test_connection().
> > > > > > >
> > > > > > > One thing I noticed: #61153's _enqueue_executor_callbacks
> > currently
> > > > > > > requires dag_run_id in the callback data dict, and
> > > > ExecuteCallback.make
> > > > > > > needs a DagRun for bundle info. Connection tests don't have a
> > DagRun.
> > > > > > > It would be a small change to make that optional. The dispatch
> > query
> > > > > > > itself is already generic (selects all PENDING
> > ExecutorCallbacks). I
> > > > > > > can take a look at decoupling that if it would be useful.
> > > > > > >
> > > > > > > A couple of other open questions:
> > > > > > >
> > > > > > > 1. The connection test needs to store an encrypted URI,
> > conn_type,
> > > > and
> > > > > > > some timestamps. Is the Callback.data JSON column the right
> place
> > > > > > > for that, or does it warrant its own small table?
> > > > > > >
> > > > > > > 2. Stale requests: if a worker crashes mid-test, the record
> stays
> > > > > > > in a non-terminal state. Should there be a scheduler-side
> reaper
> > > > > > > similar to zombie task detection, or is client-side timeout
> (60s
> > > > > > > in the UI) enough?
> > > > > > >
> > > > > > > I explored this earlier in #60618 [4] with a self-contained
> > > > > > > implementation. Now that the ExecutorCallback dispatch is
> taking
> > > > shape
> > > > > > > in #61153, building on top of will be in right direction.
> > > > > > >
> > > > > > > Thoughts?
> > > > > > >
> > > > > > > Anish
> > > > > > >
> > > > > > > [1] https://github.com/apache/airflow/pull/59643
> > > > > > > [2] https://github.com/apache/airflow/pull/54796
> > > > > > > [3] https://github.com/apache/airflow/pull/61153
> > > > > > > [4] https://github.com/apache/airflow/pull/60618
> > > > > > >
> > > > > > >
> > ---------------------------------------------------------------------
> > > > > > > To unsubscribe, e-mail: [email protected]
> > > > > > > For additional commands, e-mail: [email protected]
> > > > > > >
> > > > > > >
> > > > > >
> > > > > >
> > ---------------------------------------------------------------------
> > > > > > To unsubscribe, e-mail: [email protected]
> > > > > > For additional commands, e-mail: [email protected]
> > > > > >
> > > > > >
> > > > >
> > > >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: [email protected]
> > For additional commands, e-mail: [email protected]
> >
> >
>

Re: [DISCUSS] Move connection testing to workers

Reply via email to