ramitkataria commented on PR #66608:
URL: https://github.com/apache/airflow/pull/66608#issuecomment-4755580502

   Overall I'm good with the direction. Clear improvement over today's 
workaround. A couple of things before merge: a scope preference and one 
regression. Everything else I'm fine merging as-is, provided we open quick 
follow-ups.
   
     ### Scope: I prefer smaller PRs
   
     These aren't "fetch callback context at runtime", and splitting them out 
would make each easier to review:
   
     1. **Dynamic `VariableInterval` support**: a self-contained feature 
spanning `dag.py` resolution, the `0117` migration, the UI datamodel + route 
(`interval` -> `float | None` + validator, dropping `interval` as a sort key), 
the regenerated `openapi-gen` TS + 4 `.tsx` + `i18n/dag.json`, and its 
`test_deadlines.py` coverage. The UI code is fine but it's all downstream of 
that one change, which is why it reads as a separate feature: context-fetching 
has nothing to do with how an interval renders. I'd make it its own PR (and fix 
the regression below there).
     2. **`base_executor.py` `team_name` metric tagging**: multi-team 
observability, seems unrelated?
     3. **Scheduler/executor resilience hardening** (`begin_nested()` 
isolation, the `FOR UPDATE SKIP LOCKED` rewrite, the `prune_deadlines` 
`~missed` guard): more coupled to the callback machinery, so I won't push to 
split it, just flagging it as a distinct concern.
   
     ### Regression of existing behavior
   
     - **`VariableInterval` resolution now reads only the metadata `variable` 
table** (`dag.py` -> `session.scalar(select(VariableModel))` instead of 
`Variable.get()`), bypassing secrets backends and `AIRFLOW_VAR_*` env vars. 
Those now return `None` -> `ValueError` -> swallowed by the new `try/except`, 
so the deadline is **silently dropped**. The session-scoped read is the right 
fix for `prohibit_commit`, but it must still go through secrets/env resolution 
(or fail loudly). Untested too: the success test mocks `Variable.get`, which 
the new code no longer calls, so it passes while the deadline is skipped. 
Please add a test that inserts a real `Variable` row and asserts a `Deadline` 
is created. (Belongs in the `VariableInterval` PR if you split it; else fix 
here.)
   
     ### Fine to merge now (fast follow-ups please)
   
     - **Celery still terminally FAILs a transient context-fetch blip**: the 
retry is wired only in `LocalExecutor`, so the alert is silently lost on the 
most common production executor. Close the gap or scope+document it.
     - **PENDING-requeue has no cap/backoff**: a DagRun that exists but keeps 
failing to fetch re-enqueues every scheduler loop forever. Needs a bounded 
retry / give-up.
     - Add the index backing the new `FOR UPDATE SKIP LOCKED` callback query 
(unindexed `type`/`state`/`priority_weight`/`created_at`).
     - Add a newsfragment.
   
     ### Direction I'm reviewing toward (ties into #67919)
   
     End state: a callback gets the **same `Context` type** as a task, i.e. 
every field not tied to a task instance (`dag_run`, `run_id`, …, plus lazy 
`var`/`conn`/`macros`), with task-only fields as documented omissions, fetched 
through the Execution API on both paths. As follow-ups:
   
     - Type `build_context_from_dag_run`'s return as `Context`, add `deadline` 
as a `NotRequired` member of `Context` (+ `KNOWN_CONTEXT_KEYS`), and either 
wire `var`/`conn`/`macros` in (the read-only comms channels already exist from 
#65269) or document that callback templates are limited to Dag-run-level fields.
     - **The triggerer path fetches the DagRun with a direct 
`session.scalar(select(DagRun))`, which collides with #67919** (cc @kaxil). 
That PR moves workload-building server-side into `triggerer_workloads.py`, 
removes the triggerer's DB connection, and its new builder has no callback 
branch. Suggestion unless @kaxil prefers otherwise: **land this PR first** and 
re-home the triggerer callback path onto the server-side (API-based) builder 
during the #67919 rebase. It's being rewritten there anyway, so that's the 
right home rather than building on `_create_workload` here only to delete it. 
Keep the triggerer-side changes here minimal so that rebase stays cheap.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to