Re: [DISCUSS] Move connection testing to workers

Jarek Potiuk Fri, 27 Feb 2026 11:35:09 -0800

I would rather say we should revert the save if test connection does not
work - that should be simpler to implement.


On Fri, Feb 27, 2026 at 6:19 PM Przemysław Mirowski <[email protected]>
wrote:

> Hi,
>
> I just thought that I will add my 2c here.
>
> As I agree that saving & testing would simplify the design, I don't think
> it is a really good approach. In case of editing the connection, which is
> used by some Dags, we should be able to verify it (by test connection) if
> e.g. new user credentials will work before saving them. If they will not
> work and to see it, we will save conn, there is a possibility that some
> Dags will fail because of that. From my point of view, due to that fact,
> the test connection should be separate from saving it by overwriting it.
> Maybe "Test connection" button would save connection in DB temporarily with
> different ID which would be and "save" would stay as save only would be
> more safe approach?
> ________________________________
> From: Jarek Potiuk <[email protected]>
> Sent: 27 February 2026 07:48
> To: [email protected] <[email protected]>
> Subject: Re: [DISCUSS] Move connection testing to workers
>
> Yep all good points there
>
> On Fri, Feb 27, 2026, 01:32 Kaxil Naik <[email protected]> wrote:
>
> > Sounds good, thanks. Feel free to handle them in separate PRs if needed.
> >
> >
> > On Thu, 26 Feb 2026 at 22:27, Anish Giri <[email protected]>
> wrote:
> >
> > > Thanks Kaxil, Amogh. This is really helpful feedback.
> > > My earlier POC in #60618 was along these lines; a dedicated
> > > TestConnection workload type with its own scheduler dispatch, worker
> > > side timeout enforcement, and a DB backed state machine (PENDING →
> > > RUNNING → SUCCESS/FAILED). I moved to the ExecutorCallback-based
> > > approach in #62343 to reduce the scope, but it sounds like the
> > > dedicated workload was heading towards the right direction. That said,
> > > #60618 didn't address several of the concerns raised here and it had
> > > no queue routing (hardcoded first executor), no concurrency budget,
> > > and no reaper for stuck requests. So this isn't just a revert; it's
> > > building on that foundation with the operational controls you're
> > > describing. And based on the discussion, here is the scope I plan to
> > > implement in the updated PR:
> > >
> > > 1. Dedicated ConnectionTest workload: not transported as
> ExecutorCallback
> > > 2. Save & Test flow: connection must be saved; worker fetches via
> > > connection_id through secrets
> > > 3. Explicit routing: optional queue parameter on the API,
> > > configuration driven default. Team based routing for multi team setups
> > > as a natural follow up once that infrastructure lands (per Jarek's
> > > point that team is a connection attribute and routing should happen
> > > automatically)
> > > 4. Dispatch budget: configurable concurrency cap so connection tests
> > > can't starve task execution
> > > 5. Strict timeout + stale-work reaper: worker side timeout
> > > enforcement, scheduler side reaper for stuck RUNNING requests
> > > 6. Fail fast on unsupported executors: upfront validation rather than
> > > runtime failure
> > >
> > > If there are no objections or anything I've missed, I'll update #62343
> > > accordingly.
> > >
> > > Anish
> > >
> > > On Thu, Feb 26, 2026 at 1:52 PM Kaxil Naik <[email protected]>
> wrote:
> > > >
> > > > +1 on moving connection testing off the API server and onto workers
> for
> > > the
> > > > security/isolation reasons already discussed.
> > > >
> > > > I also think the queue concern is critical.
> > > >
> > > > Queues are a core routing abstraction in Airflow for directing work
> to
> > > > specific worker populations (runtime/network/location). For some of
> our
> > > > enterprise customer's deployments, some workers can reach certain
> > > > third-party systems while others cannot, so queue placement directly
> > > > affects the correctness of a connection test result.
> > > >
> > > > My main concern is architectural: connection testing is currently
> being
> > > > transported as an ExecutorCallback. Conceptually, connection testing
> is
> > > not
> > > > a callback; it is an ad-hoc workload. That mismatch creates two
> > concrete
> > > > risks in the current model:
> > > >
> > > > 1) Routing correctness:
> > > >    Default/fallback queue behavior can run tests in the wrong worker
> > > > environment
> > > >
> > > > 2) Fairness and isolation:
> > > >    Callbacks are correctness-critical for state progression.
> Connection
> > > > tests are user-initiated and can hang; they should not compete in the
> > > same
> > > > priority lane.
> > > >
> > > > I agree callbacks are critical for correctness (they drive
> > state/failure
> > > > progression), so we should preserve their ability to run promptly. At
> > the
> > > > same time, connection testing is different in nature (user-initiated
> > > ad-hoc
> > > > workload, potentially long-running/hanging) and should not compete in
> > the
> > > > same unbounded priority lane. A hanging connection test can occupy a
> > > worker
> > > > slot for a long time, so this needs strict timeout/reaper behavior in
> > > > addition to dispatch controls.
> > > >
> > > > So my suggestion is to require a dedicated ConnectionTest workload
> > before
> > > > merge, with:
> > > > - explicit routing semantics (queue/executor/team) -- and the
> defaults
> > > can
> > > > be config driven,
> > > > - dedicated dispatch budget/concurrency control,
> > > > - strict timeout + stale-work reaper,
> > > > - fail-fast behavior on unsupported executors.
> > > >
> > > > This keeps the security goal while avoiding routing regressions and
> > > > scheduler starvation risk, and keeps callback lanes reserved for
> > callback
> > > > semantics.
> > > >
> > > > Regards,
> > > > Kaxil
> > > >
> > > > On Thu, 26 Feb 2026 at 07:59, Amogh Desai <[email protected]>
> > wrote:
> > > >
> > > > > Good discussion, Anish. +1 on moving connection testing to workers
> --
> > > > > running user supplied
> > > > > driver code on API server is the right thing to move away from.
> > > > >
> > > > > Strongly agree with Ephraim and Jarek here. If we require the
> > > connection to
> > > > > be saved before testing
> > > > > (so the worker can fetch it via connection_id through the standard
> > > secrets
> > > > > path), the UI button should
> > > > > make that explicit. "Save & Test" is exactly that and it will avoid
> > any
> > > > > surprise of overwriting an existing
> > > > > connection while *just testing :)*
> > > > >
> > > > > Re queue routing: I think an optional queue parameter on the test
> > > > > connection API should be part of the
> > > > > initial implementation itself rather than follow up. The default
> can
> > be
> > > > > *default*, which covers the common case.
> > > > > But there are some other scenarios too:
> > > > >
> > > > > * Workers on different queues deployed in different network zones
> > > meaning
> > > > > that the test would pass or fail
> > > > > depending on which queue makes it run.
> > > > >
> > > > > * Setups where different queues have different secrets backend, the
> > > worker
> > > > > might not even be able to resolve the
> > > > > connection at all.
> > > > >
> > > > > The cost of having an optional queue parameter is kinda low and I
> > > think it
> > > > > avoids shipping something that
> > > > > gives wrong answers in these setups, so if the cost of it is
> > relatively
> > > > > low, can we consider doing it?
> > > > >
> > > > >
> > > > >
> > > > > Thanks & Regards,
> > > > > Amogh Desai
> > > > >
> > > > >
> > > > > On Tue, Feb 24, 2026 at 10:29 PM Jarek Potiuk <[email protected]>
> > > wrote:
> > > > >
> > > > > > > Maybe changing the button to 'save & test' would suffix.
> > > > > >
> > > > > > +10
> > > > > >
> > > > > > On Tue, Feb 24, 2026 at 1:17 PM Ephraim Anierobi <
> > > > > > [email protected]>
> > > > > > wrote:
> > > > > >
> > > > > > > Cool idea.
> > > > > > >
> > > > > > > On saving the connection automatically, I think we should make
> it
> > > > > > explicit
> > > > > > > in the UI that testing the connection will save the connection.
> > > This
> > > > > will
> > > > > > > help the users to know that they are not just testing but also
> > > creating
> > > > > > the
> > > > > > > connection with the entered credentials. Without this being
> > > explicit, I
> > > > > > > think users may unknowingly replace a connection while just
> > trying
> > > if a
> > > > > > new
> > > > > > > connection would work.
> > > > > > >
> > > > > > > Maybe changing the button to 'save & test' would suffix.
> > > > > > >
> > > > > > > On 2026/02/22 03:50:11 Anish Giri wrote:
> > > > > > > > Hi all,
> > > > > > > >
> > > > > > > > I'd like to discuss moving connection testing off the API
> > server
> > > and
> > > > > > > > onto workers. Jarek suggested this direction in a comment on
> > > #59643
> > > > > > > > [1], and I think the Callback infrastructure being built for
> > > running
> > > > > > > > callbacks on executors is the right foundation for it.
> > > > > > > >
> > > > > > > > Since 2.7.0, test_connection has been disabled by default
> > > (#32052).
> > > > > > > > Running it on the API server has two problems: the API server
> > > > > > > > shouldn't be executing user-supplied driver code (Jarek
> > > described the
> > > > > > > > ODBC/JDBC risks in detail on #59643), and workers typically
> > have
> > > > > > > > network access to external systems that API servers don't, so
> > > test
> > > > > > > > results from the API server can be misleading.
> > > > > > > >
> > > > > > > > Ramit's generic Callback model (#54796 [2]) and Ferruzzi's
> > > > > > > > in-progress executor dispatch (#61153 [3]) together give us
> > most
> > > of
> > > > > > > > what's needed. The flow would be:
> > > > > > > >
> > > > > > > > 1. UI calls POST /connections/test
> > > > > > > > 2. API server Fernet-encrypts the connection URI, creates an
> > > > > > > > ExecutorCallback pointing to the test function, returns an ID
> > > > > > > > 3. Scheduler dispatch loop (from #61153) picks it up, sends
> it
> > > > > > > > to the executor
> > > > > > > > 4. Worker decrypts the URI, builds a transient Connection,
> > calls
> > > > > > > > test_connection(), reports result through the callback path
> > > > > > > > 5. UI polls GET /connections/test/{id} until it gets a
> terminal
> > > > > > > > state
> > > > > > > >
> > > > > > > > The connection-testing-specific code would be small: a POST
> > > endpoint
> > > > > > > > to queue the test, a GET endpoint to poll for results, and
> the
> > > worker
> > > > > > > > function that decrypts and runs test_connection().
> > > > > > > >
> > > > > > > > One thing I noticed: #61153's _enqueue_executor_callbacks
> > > currently
> > > > > > > > requires dag_run_id in the callback data dict, and
> > > > > ExecuteCallback.make
> > > > > > > > needs a DagRun for bundle info. Connection tests don't have a
> > > DagRun.
> > > > > > > > It would be a small change to make that optional. The
> dispatch
> > > query
> > > > > > > > itself is already generic (selects all PENDING
> > > ExecutorCallbacks). I
> > > > > > > > can take a look at decoupling that if it would be useful.
> > > > > > > >
> > > > > > > > A couple of other open questions:
> > > > > > > >
> > > > > > > > 1. The connection test needs to store an encrypted URI,
> > > conn_type,
> > > > > and
> > > > > > > > some timestamps. Is the Callback.data JSON column the right
> > place
> > > > > > > > for that, or does it warrant its own small table?
> > > > > > > >
> > > > > > > > 2. Stale requests: if a worker crashes mid-test, the record
> > stays
> > > > > > > > in a non-terminal state. Should there be a scheduler-side
> > reaper
> > > > > > > > similar to zombie task detection, or is client-side timeout
> > (60s
> > > > > > > > in the UI) enough?
> > > > > > > >
> > > > > > > > I explored this earlier in #60618 [4] with a self-contained
> > > > > > > > implementation. Now that the ExecutorCallback dispatch is
> > taking
> > > > > shape
> > > > > > > > in #61153, building on top of will be in right direction.
> > > > > > > >
> > > > > > > > Thoughts?
> > > > > > > >
> > > > > > > > Anish
> > > > > > > >
> > > > > > > > [1] https://github.com/apache/airflow/pull/59643
> > > > > > > > [2] https://github.com/apache/airflow/pull/54796
> > > > > > > > [3] https://github.com/apache/airflow/pull/61153
> > > > > > > > [4] https://github.com/apache/airflow/pull/60618
> > > > > > > >
> > > > > > > >
> > > ---------------------------------------------------------------------
> > > > > > > > To unsubscribe, e-mail: [email protected]
> > > > > > > > For additional commands, e-mail: [email protected]
> > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > > >
> > > ---------------------------------------------------------------------
> > > > > > > To unsubscribe, e-mail: [email protected]
> > > > > > > For additional commands, e-mail: [email protected]
> > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > >
> > > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: [email protected]
> > > For additional commands, e-mail: [email protected]
> > >
> > >
> >
>

Re: [DISCUSS] Move connection testing to workers

Reply via email to