Re: [DISCUSS] Move connection testing to workers

Anish Giri Fri, 27 Feb 2026 13:10:49 -0800

Thanks Przemysław. On current main, the UI already enforces this the
test button only appears in the list view for already saved
connections, not in the create/edit form. So a connection must be
saved before it can be tested.


In the async design, this is also an architectural requirement workers
fetch the connection by `connection_id` through the secrets backend,
so it must exist before the test can run.

For the edit case Przemysław raised: where a user updates credentials
on a connection that DAGs depend on - Jarek's suggestion of reverting
to the previous values on test failure would address that. This is
something we can accommodate in the current design. Let me know if I'm
missing something.

Anish


On Fri, Feb 27, 2026 at 1:35 PM Jarek Potiuk <[email protected]> wrote:
>
> I would rather say we should revert the save if test connection does not
> work - that should be simpler to implement.
>
> On Fri, Feb 27, 2026 at 6:19 PM Przemysław Mirowski <[email protected]>
> wrote:
>
> > Hi,
> >
> > I just thought that I will add my 2c here.
> >
> > As I agree that saving & testing would simplify the design, I don't think
> > it is a really good approach. In case of editing the connection, which is
> > used by some Dags, we should be able to verify it (by test connection) if
> > e.g. new user credentials will work before saving them. If they will not
> > work and to see it, we will save conn, there is a possibility that some
> > Dags will fail because of that. From my point of view, due to that fact,
> > the test connection should be separate from saving it by overwriting it.
> > Maybe "Test connection" button would save connection in DB temporarily with
> > different ID which would be and "save" would stay as save only would be
> > more safe approach?
> > ________________________________
> > From: Jarek Potiuk <[email protected]>
> > Sent: 27 February 2026 07:48
> > To: [email protected] <[email protected]>
> > Subject: Re: [DISCUSS] Move connection testing to workers
> >
> > Yep all good points there
> >
> > On Fri, Feb 27, 2026, 01:32 Kaxil Naik <[email protected]> wrote:
> >
> > > Sounds good, thanks. Feel free to handle them in separate PRs if needed.
> > >
> > >
> > > On Thu, 26 Feb 2026 at 22:27, Anish Giri <[email protected]>
> > wrote:
> > >
> > > > Thanks Kaxil, Amogh. This is really helpful feedback.
> > > > My earlier POC in #60618 was along these lines; a dedicated
> > > > TestConnection workload type with its own scheduler dispatch, worker
> > > > side timeout enforcement, and a DB backed state machine (PENDING →
> > > > RUNNING → SUCCESS/FAILED). I moved to the ExecutorCallback-based
> > > > approach in #62343 to reduce the scope, but it sounds like the
> > > > dedicated workload was heading towards the right direction. That said,
> > > > #60618 didn't address several of the concerns raised here and it had
> > > > no queue routing (hardcoded first executor), no concurrency budget,
> > > > and no reaper for stuck requests. So this isn't just a revert; it's
> > > > building on that foundation with the operational controls you're
> > > > describing. And based on the discussion, here is the scope I plan to
> > > > implement in the updated PR:
> > > >
> > > > 1. Dedicated ConnectionTest workload: not transported as
> > ExecutorCallback
> > > > 2. Save & Test flow: connection must be saved; worker fetches via
> > > > connection_id through secrets
> > > > 3. Explicit routing: optional queue parameter on the API,
> > > > configuration driven default. Team based routing for multi team setups
> > > > as a natural follow up once that infrastructure lands (per Jarek's
> > > > point that team is a connection attribute and routing should happen
> > > > automatically)
> > > > 4. Dispatch budget: configurable concurrency cap so connection tests
> > > > can't starve task execution
> > > > 5. Strict timeout + stale-work reaper: worker side timeout
> > > > enforcement, scheduler side reaper for stuck RUNNING requests
> > > > 6. Fail fast on unsupported executors: upfront validation rather than
> > > > runtime failure
> > > >
> > > > If there are no objections or anything I've missed, I'll update #62343
> > > > accordingly.
> > > >
> > > > Anish
> > > >
> > > > On Thu, Feb 26, 2026 at 1:52 PM Kaxil Naik <[email protected]>
> > wrote:
> > > > >
> > > > > +1 on moving connection testing off the API server and onto workers
> > for
> > > > the
> > > > > security/isolation reasons already discussed.
> > > > >
> > > > > I also think the queue concern is critical.
> > > > >
> > > > > Queues are a core routing abstraction in Airflow for directing work
> > to
> > > > > specific worker populations (runtime/network/location). For some of
> > our
> > > > > enterprise customer's deployments, some workers can reach certain
> > > > > third-party systems while others cannot, so queue placement directly
> > > > > affects the correctness of a connection test result.
> > > > >
> > > > > My main concern is architectural: connection testing is currently
> > being
> > > > > transported as an ExecutorCallback. Conceptually, connection testing
> > is
> > > > not
> > > > > a callback; it is an ad-hoc workload. That mismatch creates two
> > > concrete
> > > > > risks in the current model:
> > > > >
> > > > > 1) Routing correctness:
> > > > >    Default/fallback queue behavior can run tests in the wrong worker
> > > > > environment
> > > > >
> > > > > 2) Fairness and isolation:
> > > > >    Callbacks are correctness-critical for state progression.
> > Connection
> > > > > tests are user-initiated and can hang; they should not compete in the
> > > > same
> > > > > priority lane.
> > > > >
> > > > > I agree callbacks are critical for correctness (they drive
> > > state/failure
> > > > > progression), so we should preserve their ability to run promptly. At
> > > the
> > > > > same time, connection testing is different in nature (user-initiated
> > > > ad-hoc
> > > > > workload, potentially long-running/hanging) and should not compete in
> > > the
> > > > > same unbounded priority lane. A hanging connection test can occupy a
> > > > worker
> > > > > slot for a long time, so this needs strict timeout/reaper behavior in
> > > > > addition to dispatch controls.
> > > > >
> > > > > So my suggestion is to require a dedicated ConnectionTest workload
> > > before
> > > > > merge, with:
> > > > > - explicit routing semantics (queue/executor/team) -- and the
> > defaults
> > > > can
> > > > > be config driven,
> > > > > - dedicated dispatch budget/concurrency control,
> > > > > - strict timeout + stale-work reaper,
> > > > > - fail-fast behavior on unsupported executors.
> > > > >
> > > > > This keeps the security goal while avoiding routing regressions and
> > > > > scheduler starvation risk, and keeps callback lanes reserved for
> > > callback
> > > > > semantics.
> > > > >
> > > > > Regards,
> > > > > Kaxil
> > > > >
> > > > > On Thu, 26 Feb 2026 at 07:59, Amogh Desai <[email protected]>
> > > wrote:
> > > > >
> > > > > > Good discussion, Anish. +1 on moving connection testing to workers
> > --
> > > > > > running user supplied
> > > > > > driver code on API server is the right thing to move away from.
> > > > > >
> > > > > > Strongly agree with Ephraim and Jarek here. If we require the
> > > > connection to
> > > > > > be saved before testing
> > > > > > (so the worker can fetch it via connection_id through the standard
> > > > secrets
> > > > > > path), the UI button should
> > > > > > make that explicit. "Save & Test" is exactly that and it will avoid
> > > any
> > > > > > surprise of overwriting an existing
> > > > > > connection while *just testing :)*
> > > > > >
> > > > > > Re queue routing: I think an optional queue parameter on the test
> > > > > > connection API should be part of the
> > > > > > initial implementation itself rather than follow up. The default
> > can
> > > be
> > > > > > *default*, which covers the common case.
> > > > > > But there are some other scenarios too:
> > > > > >
> > > > > > * Workers on different queues deployed in different network zones
> > > > meaning
> > > > > > that the test would pass or fail
> > > > > > depending on which queue makes it run.
> > > > > >
> > > > > > * Setups where different queues have different secrets backend, the
> > > > worker
> > > > > > might not even be able to resolve the
> > > > > > connection at all.
> > > > > >
> > > > > > The cost of having an optional queue parameter is kinda low and I
> > > > think it
> > > > > > avoids shipping something that
> > > > > > gives wrong answers in these setups, so if the cost of it is
> > > relatively
> > > > > > low, can we consider doing it?
> > > > > >
> > > > > >
> > > > > >
> > > > > > Thanks & Regards,
> > > > > > Amogh Desai
> > > > > >
> > > > > >
> > > > > > On Tue, Feb 24, 2026 at 10:29 PM Jarek Potiuk <[email protected]>
> > > > wrote:
> > > > > >
> > > > > > > > Maybe changing the button to 'save & test' would suffix.
> > > > > > >
> > > > > > > +10
> > > > > > >
> > > > > > > On Tue, Feb 24, 2026 at 1:17 PM Ephraim Anierobi <
> > > > > > > [email protected]>
> > > > > > > wrote:
> > > > > > >
> > > > > > > > Cool idea.
> > > > > > > >
> > > > > > > > On saving the connection automatically, I think we should make
> > it
> > > > > > > explicit
> > > > > > > > in the UI that testing the connection will save the connection.
> > > > This
> > > > > > will
> > > > > > > > help the users to know that they are not just testing but also
> > > > creating
> > > > > > > the
> > > > > > > > connection with the entered credentials. Without this being
> > > > explicit, I
> > > > > > > > think users may unknowingly replace a connection while just
> > > trying
> > > > if a
> > > > > > > new
> > > > > > > > connection would work.
> > > > > > > >
> > > > > > > > Maybe changing the button to 'save & test' would suffix.
> > > > > > > >
> > > > > > > > On 2026/02/22 03:50:11 Anish Giri wrote:
> > > > > > > > > Hi all,
> > > > > > > > >
> > > > > > > > > I'd like to discuss moving connection testing off the API
> > > server
> > > > and
> > > > > > > > > onto workers. Jarek suggested this direction in a comment on
> > > > #59643
> > > > > > > > > [1], and I think the Callback infrastructure being built for
> > > > running
> > > > > > > > > callbacks on executors is the right foundation for it.
> > > > > > > > >
> > > > > > > > > Since 2.7.0, test_connection has been disabled by default
> > > > (#32052).
> > > > > > > > > Running it on the API server has two problems: the API server
> > > > > > > > > shouldn't be executing user-supplied driver code (Jarek
> > > > described the
> > > > > > > > > ODBC/JDBC risks in detail on #59643), and workers typically
> > > have
> > > > > > > > > network access to external systems that API servers don't, so
> > > > test
> > > > > > > > > results from the API server can be misleading.
> > > > > > > > >
> > > > > > > > > Ramit's generic Callback model (#54796 [2]) and Ferruzzi's
> > > > > > > > > in-progress executor dispatch (#61153 [3]) together give us
> > > most
> > > > of
> > > > > > > > > what's needed. The flow would be:
> > > > > > > > >
> > > > > > > > > 1. UI calls POST /connections/test
> > > > > > > > > 2. API server Fernet-encrypts the connection URI, creates an
> > > > > > > > > ExecutorCallback pointing to the test function, returns an ID
> > > > > > > > > 3. Scheduler dispatch loop (from #61153) picks it up, sends
> > it
> > > > > > > > > to the executor
> > > > > > > > > 4. Worker decrypts the URI, builds a transient Connection,
> > > calls
> > > > > > > > > test_connection(), reports result through the callback path
> > > > > > > > > 5. UI polls GET /connections/test/{id} until it gets a
> > terminal
> > > > > > > > > state
> > > > > > > > >
> > > > > > > > > The connection-testing-specific code would be small: a POST
> > > > endpoint
> > > > > > > > > to queue the test, a GET endpoint to poll for results, and
> > the
> > > > worker
> > > > > > > > > function that decrypts and runs test_connection().
> > > > > > > > >
> > > > > > > > > One thing I noticed: #61153's _enqueue_executor_callbacks
> > > > currently
> > > > > > > > > requires dag_run_id in the callback data dict, and
> > > > > > ExecuteCallback.make
> > > > > > > > > needs a DagRun for bundle info. Connection tests don't have a
> > > > DagRun.
> > > > > > > > > It would be a small change to make that optional. The
> > dispatch
> > > > query
> > > > > > > > > itself is already generic (selects all PENDING
> > > > ExecutorCallbacks). I
> > > > > > > > > can take a look at decoupling that if it would be useful.
> > > > > > > > >
> > > > > > > > > A couple of other open questions:
> > > > > > > > >
> > > > > > > > > 1. The connection test needs to store an encrypted URI,
> > > > conn_type,
> > > > > > and
> > > > > > > > > some timestamps. Is the Callback.data JSON column the right
> > > place
> > > > > > > > > for that, or does it warrant its own small table?
> > > > > > > > >
> > > > > > > > > 2. Stale requests: if a worker crashes mid-test, the record
> > > stays
> > > > > > > > > in a non-terminal state. Should there be a scheduler-side
> > > reaper
> > > > > > > > > similar to zombie task detection, or is client-side timeout
> > > (60s
> > > > > > > > > in the UI) enough?
> > > > > > > > >
> > > > > > > > > I explored this earlier in #60618 [4] with a self-contained
> > > > > > > > > implementation. Now that the ExecutorCallback dispatch is
> > > taking
> > > > > > shape
> > > > > > > > > in #61153, building on top of will be in right direction.
> > > > > > > > >
> > > > > > > > > Thoughts?
> > > > > > > > >
> > > > > > > > > Anish
> > > > > > > > >
> > > > > > > > > [1] https://github.com/apache/airflow/pull/59643
> > > > > > > > > [2] https://github.com/apache/airflow/pull/54796
> > > > > > > > > [3] https://github.com/apache/airflow/pull/61153
> > > > > > > > > [4] https://github.com/apache/airflow/pull/60618
> > > > > > > > >
> > > > > > > > >
> > > > ---------------------------------------------------------------------
> > > > > > > > > To unsubscribe, e-mail: [email protected]
> > > > > > > > > For additional commands, e-mail: [email protected]
> > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > ---------------------------------------------------------------------
> > > > > > > > To unsubscribe, e-mail: [email protected]
> > > > > > > > For additional commands, e-mail: [email protected]
> > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > >
> > > > ---------------------------------------------------------------------
> > > > To unsubscribe, e-mail: [email protected]
> > > > For additional commands, e-mail: [email protected]
> > > >
> > > >
> > >
> >

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [DISCUSS] Move connection testing to workers

Reply via email to