Yep all good points there On Fri, Feb 27, 2026, 01:32 Kaxil Naik <[email protected]> wrote:
> Sounds good, thanks. Feel free to handle them in separate PRs if needed. > > > On Thu, 26 Feb 2026 at 22:27, Anish Giri <[email protected]> wrote: > > > Thanks Kaxil, Amogh. This is really helpful feedback. > > My earlier POC in #60618 was along these lines; a dedicated > > TestConnection workload type with its own scheduler dispatch, worker > > side timeout enforcement, and a DB backed state machine (PENDING → > > RUNNING → SUCCESS/FAILED). I moved to the ExecutorCallback-based > > approach in #62343 to reduce the scope, but it sounds like the > > dedicated workload was heading towards the right direction. That said, > > #60618 didn't address several of the concerns raised here and it had > > no queue routing (hardcoded first executor), no concurrency budget, > > and no reaper for stuck requests. So this isn't just a revert; it's > > building on that foundation with the operational controls you're > > describing. And based on the discussion, here is the scope I plan to > > implement in the updated PR: > > > > 1. Dedicated ConnectionTest workload: not transported as ExecutorCallback > > 2. Save & Test flow: connection must be saved; worker fetches via > > connection_id through secrets > > 3. Explicit routing: optional queue parameter on the API, > > configuration driven default. Team based routing for multi team setups > > as a natural follow up once that infrastructure lands (per Jarek's > > point that team is a connection attribute and routing should happen > > automatically) > > 4. Dispatch budget: configurable concurrency cap so connection tests > > can't starve task execution > > 5. Strict timeout + stale-work reaper: worker side timeout > > enforcement, scheduler side reaper for stuck RUNNING requests > > 6. Fail fast on unsupported executors: upfront validation rather than > > runtime failure > > > > If there are no objections or anything I've missed, I'll update #62343 > > accordingly. > > > > Anish > > > > On Thu, Feb 26, 2026 at 1:52 PM Kaxil Naik <[email protected]> wrote: > > > > > > +1 on moving connection testing off the API server and onto workers for > > the > > > security/isolation reasons already discussed. > > > > > > I also think the queue concern is critical. > > > > > > Queues are a core routing abstraction in Airflow for directing work to > > > specific worker populations (runtime/network/location). For some of our > > > enterprise customer's deployments, some workers can reach certain > > > third-party systems while others cannot, so queue placement directly > > > affects the correctness of a connection test result. > > > > > > My main concern is architectural: connection testing is currently being > > > transported as an ExecutorCallback. Conceptually, connection testing is > > not > > > a callback; it is an ad-hoc workload. That mismatch creates two > concrete > > > risks in the current model: > > > > > > 1) Routing correctness: > > > Default/fallback queue behavior can run tests in the wrong worker > > > environment > > > > > > 2) Fairness and isolation: > > > Callbacks are correctness-critical for state progression. Connection > > > tests are user-initiated and can hang; they should not compete in the > > same > > > priority lane. > > > > > > I agree callbacks are critical for correctness (they drive > state/failure > > > progression), so we should preserve their ability to run promptly. At > the > > > same time, connection testing is different in nature (user-initiated > > ad-hoc > > > workload, potentially long-running/hanging) and should not compete in > the > > > same unbounded priority lane. A hanging connection test can occupy a > > worker > > > slot for a long time, so this needs strict timeout/reaper behavior in > > > addition to dispatch controls. > > > > > > So my suggestion is to require a dedicated ConnectionTest workload > before > > > merge, with: > > > - explicit routing semantics (queue/executor/team) -- and the defaults > > can > > > be config driven, > > > - dedicated dispatch budget/concurrency control, > > > - strict timeout + stale-work reaper, > > > - fail-fast behavior on unsupported executors. > > > > > > This keeps the security goal while avoiding routing regressions and > > > scheduler starvation risk, and keeps callback lanes reserved for > callback > > > semantics. > > > > > > Regards, > > > Kaxil > > > > > > On Thu, 26 Feb 2026 at 07:59, Amogh Desai <[email protected]> > wrote: > > > > > > > Good discussion, Anish. +1 on moving connection testing to workers -- > > > > running user supplied > > > > driver code on API server is the right thing to move away from. > > > > > > > > Strongly agree with Ephraim and Jarek here. If we require the > > connection to > > > > be saved before testing > > > > (so the worker can fetch it via connection_id through the standard > > secrets > > > > path), the UI button should > > > > make that explicit. "Save & Test" is exactly that and it will avoid > any > > > > surprise of overwriting an existing > > > > connection while *just testing :)* > > > > > > > > Re queue routing: I think an optional queue parameter on the test > > > > connection API should be part of the > > > > initial implementation itself rather than follow up. The default can > be > > > > *default*, which covers the common case. > > > > But there are some other scenarios too: > > > > > > > > * Workers on different queues deployed in different network zones > > meaning > > > > that the test would pass or fail > > > > depending on which queue makes it run. > > > > > > > > * Setups where different queues have different secrets backend, the > > worker > > > > might not even be able to resolve the > > > > connection at all. > > > > > > > > The cost of having an optional queue parameter is kinda low and I > > think it > > > > avoids shipping something that > > > > gives wrong answers in these setups, so if the cost of it is > relatively > > > > low, can we consider doing it? > > > > > > > > > > > > > > > > Thanks & Regards, > > > > Amogh Desai > > > > > > > > > > > > On Tue, Feb 24, 2026 at 10:29 PM Jarek Potiuk <[email protected]> > > wrote: > > > > > > > > > > Maybe changing the button to 'save & test' would suffix. > > > > > > > > > > +10 > > > > > > > > > > On Tue, Feb 24, 2026 at 1:17 PM Ephraim Anierobi < > > > > > [email protected]> > > > > > wrote: > > > > > > > > > > > Cool idea. > > > > > > > > > > > > On saving the connection automatically, I think we should make it > > > > > explicit > > > > > > in the UI that testing the connection will save the connection. > > This > > > > will > > > > > > help the users to know that they are not just testing but also > > creating > > > > > the > > > > > > connection with the entered credentials. Without this being > > explicit, I > > > > > > think users may unknowingly replace a connection while just > trying > > if a > > > > > new > > > > > > connection would work. > > > > > > > > > > > > Maybe changing the button to 'save & test' would suffix. > > > > > > > > > > > > On 2026/02/22 03:50:11 Anish Giri wrote: > > > > > > > Hi all, > > > > > > > > > > > > > > I'd like to discuss moving connection testing off the API > server > > and > > > > > > > onto workers. Jarek suggested this direction in a comment on > > #59643 > > > > > > > [1], and I think the Callback infrastructure being built for > > running > > > > > > > callbacks on executors is the right foundation for it. > > > > > > > > > > > > > > Since 2.7.0, test_connection has been disabled by default > > (#32052). > > > > > > > Running it on the API server has two problems: the API server > > > > > > > shouldn't be executing user-supplied driver code (Jarek > > described the > > > > > > > ODBC/JDBC risks in detail on #59643), and workers typically > have > > > > > > > network access to external systems that API servers don't, so > > test > > > > > > > results from the API server can be misleading. > > > > > > > > > > > > > > Ramit's generic Callback model (#54796 [2]) and Ferruzzi's > > > > > > > in-progress executor dispatch (#61153 [3]) together give us > most > > of > > > > > > > what's needed. The flow would be: > > > > > > > > > > > > > > 1. UI calls POST /connections/test > > > > > > > 2. API server Fernet-encrypts the connection URI, creates an > > > > > > > ExecutorCallback pointing to the test function, returns an ID > > > > > > > 3. Scheduler dispatch loop (from #61153) picks it up, sends it > > > > > > > to the executor > > > > > > > 4. Worker decrypts the URI, builds a transient Connection, > calls > > > > > > > test_connection(), reports result through the callback path > > > > > > > 5. UI polls GET /connections/test/{id} until it gets a terminal > > > > > > > state > > > > > > > > > > > > > > The connection-testing-specific code would be small: a POST > > endpoint > > > > > > > to queue the test, a GET endpoint to poll for results, and the > > worker > > > > > > > function that decrypts and runs test_connection(). > > > > > > > > > > > > > > One thing I noticed: #61153's _enqueue_executor_callbacks > > currently > > > > > > > requires dag_run_id in the callback data dict, and > > > > ExecuteCallback.make > > > > > > > needs a DagRun for bundle info. Connection tests don't have a > > DagRun. > > > > > > > It would be a small change to make that optional. The dispatch > > query > > > > > > > itself is already generic (selects all PENDING > > ExecutorCallbacks). I > > > > > > > can take a look at decoupling that if it would be useful. > > > > > > > > > > > > > > A couple of other open questions: > > > > > > > > > > > > > > 1. The connection test needs to store an encrypted URI, > > conn_type, > > > > and > > > > > > > some timestamps. Is the Callback.data JSON column the right > place > > > > > > > for that, or does it warrant its own small table? > > > > > > > > > > > > > > 2. Stale requests: if a worker crashes mid-test, the record > stays > > > > > > > in a non-terminal state. Should there be a scheduler-side > reaper > > > > > > > similar to zombie task detection, or is client-side timeout > (60s > > > > > > > in the UI) enough? > > > > > > > > > > > > > > I explored this earlier in #60618 [4] with a self-contained > > > > > > > implementation. Now that the ExecutorCallback dispatch is > taking > > > > shape > > > > > > > in #61153, building on top of will be in right direction. > > > > > > > > > > > > > > Thoughts? > > > > > > > > > > > > > > Anish > > > > > > > > > > > > > > [1] https://github.com/apache/airflow/pull/59643 > > > > > > > [2] https://github.com/apache/airflow/pull/54796 > > > > > > > [3] https://github.com/apache/airflow/pull/61153 > > > > > > > [4] https://github.com/apache/airflow/pull/60618 > > > > > > > > > > > > > > > > --------------------------------------------------------------------- > > > > > > > To unsubscribe, e-mail: [email protected] > > > > > > > For additional commands, e-mail: [email protected] > > > > > > > > > > > > > > > > > > > > > > > > > > > > --------------------------------------------------------------------- > > > > > > To unsubscribe, e-mail: [email protected] > > > > > > For additional commands, e-mail: [email protected] > > > > > > > > > > > > > > > > > > > > > > > > > --------------------------------------------------------------------- > > To unsubscribe, e-mail: [email protected] > > For additional commands, e-mail: [email protected] > > > > >
