Thanks Stefan, and thanks Jens for the follow-up question. On the AIP-96/97 convergence question you both raised:
I looked at AIP-97 carefully. AIP-105 and AIP-97 cover different failure domains and don't block each other. AIP-105's RetryPolicy runs in the worker process, after the exception is caught in the task's try/except. It handles failures that manifest as Python exceptions: rate limits, auth errors, connection timeouts, transient DB errors. The worker is alive, the exception object is available, and the policy can inspect it. To Jens's question directly: when the worker dies (segfault, pod eviction, OOM kill, heartbeat loss), AIP-105's policy never runs. The worker process is gone. In that case, the existing scheduler-based retry kicks in -- exactly as it does today. That's AIP-97's territory: the executor or scheduler detects the failure externally and manages a separate infrastructure retry budget. So the split is: - Application failures (rate limits, auth errors, data validation) raise Python exceptions in user code -- AIP-105 handles these - Infrastructure failures (pod eviction, OOM kill, worker heartbeat loss) kill the worker process before any exception is caught -- AIP-97 handles these and since it touches Scheduler / Executor -- it is more involved. They're parallel tracks with separate execution paths. Thanks, Kaxil On Tue, 21 Apr 2026 at 21:22, Jens Scheffler <[email protected]> wrote: > Good point that Stefan made - I also had commented on the relation to > AIP-97 which I would love to have or converge AIP-105 with. > > In this light, actually what would be the intend of the Retry policy if > the worker "dies" in a segfault or loses heartbeat? Then the standard / > existing scheduler based retry is kicking-in? > > Jens > > On 21.04.26 02:19, Stefan Wang wrote: > > Thanks Kaxil, > > huge +1 > > > > This feels like a meaningful step forward. > > > > Giving users a way to express retry intent and putting the policy on the > > operator is something we've needed for a while. The current options > > aren't great: wrap everything in try/except and raise > > AirflowFailException, or live with retries=3 as a blunt instrument. > > Both are compromises. > > > > A few things that stand out in the design: > > > > 1. I think Evaluating on the worker is the right call. Exceptions don't > serialize > > cleanly across process boundaries, and keeping the decision close to > > where the exception actually happens avoids a lot of indirection. The > > scheduler-side version would be simpler to ship but harder to use. > > > > 2. The flat rule list is easier to reason about and validate at parse > time > > than a nested structure would be. Elad's suggestion to let one rule > > match multiple exception types would tighten the common case without > > losing that. > > > > A couple of thoughts that came up while reading: > > > > 1. On Paweł's testing point: if policy.evaluate() is just a method you > can > > call with a synthetic exception, DAG authors can cover a lot of ground > > in unit tests. Not the same as validating in production, but catches a > > decent amount before deploy. > > > > 2. On retry budgets (separate infra retry budget) more broadly: > > retries=N today can get consumed by > > worker evictions or heartbeat losses before any retry policy ever runs. > > Pluggable policies will feel sharper once the user-visible budget > > actually reflects user-domain failures. I also have two drafts touching > this > > area, AIP-96 (Resumable Operators) and AIP-97 (Execution Context + > separate infra > > retry budget), and will post updates on both soon. Open to converging > where it makes sense. > > For what it's worth, we've been running two related pieces in production > > at LinkedIn. One is a mixin that preserves external jobs (Spark, Flink, > > and similar) when the worker gets disrupted instead of cancelling them. > > The other is a separate infrastructure retry budget set generously > > enough that infrastructure events don't eat into user-visible retries. I > > can share anonymized failure-category data from both if it would help > > ground the default rule library. > > > > Looking forward to v2. > > > > — Stefan > > > >> On Apr 20, 2026, at 1:50 PM, Przemysław Mirowski <[email protected]> > wrote: > >> > >> Great idea! Thanks for proposing it. It will make proper > exception-retry handling much easier than it was before and will open a new > door for more extensibility too. > >> > >> +1 also to the questions/concerts which Elad mentioned. Not sure though > regarding the changes to Priority Weight (maybe part of AIP-100) and point > 2 connected to not having full control over exception raised, looking at > the Airflow ecosystem, all of the providers with different libraries, I > think it is something which we should consider. > >> > >> One additional comment - as the Retry Policies will only run on workers > (which is pretty nice from e.g. security point of view), I didn't see in > AIP and PR a way to validate if configured Retry Policy will work before > actually the time when it will be needed. That can make setting the Retry > Policies harder and the testing them will be cumbersome. I think that > having a nice way (from Dag Authors perspective) of testing the defined > Retry Policy if it will actually work when it really be needed, would make > Dag Authors lifes much easier and defining these rules much easier > (something in some way connected to that could be testing the Airflow > Connections and work for moving the "Test Connection" to workers). Of > course, Retry Policies like LLM-related are rather out-of-scope, but > testing more deterministic behaviours should be much easier to do. > >> > >> ________________________________ > >> From: Vincent Beck <[email protected]> > >> Sent: 20 April 2026 15:17 > >> To: [email protected] <[email protected]> > >> Subject: Re: [DISCUSS] AIP-105: Pluggable Retry Policies > >> > >> Makes a lot of sense to me! > >> > >> On 2026/04/19 13:56:56 Elad Kalif wrote: > >>> Great idea! > >>> Love it! > >>> > >>> I have some questions / comments: > >>> 1. The current interface suggests rules that contain a RetryRule > object. > >>> but I wonder if we should change exception to exceptions and accepting > a > >>> list. > >>> > >>> rules=[ > >>> RetryRule( > >>> exceptions=["requests.exceptions.HTTPError", > >>> "google.auth.exceptions.RefreshError"] > >>> ..., > >>> )] > >>> > >>> I'm thinking about a case where several exceptions need the same > behaviour > >>> and user may not wish to offer different reasoning for each. > >>> > >>> 2. Does it make sense to extend the interface for xcom values? I'm > thinking > >>> about a case where dag authors don't have full control over the > exception > >>> raised or even some upstream library changing the exception which > results > >>> in retry logic to be broken. Maybe we should offer also the option to > set > >>> retry based on previous attempt xcom value? > >>> > >>> 3. Maybe something for the longer run but still worth discussing - one > of > >>> the main motivations for custom weight rules > >>> > https://airflow.apache.org/docs/apache-airflow/stable/administration-and-deployment/priority-weight.html#custom-weight-rule > >>> was to set priority based on try number. I wonder if we may want to > somehow > >>> combine it with the Retry rule. For retries, I can argue that the > weight of > >>> the task is a property of retry instructions and it can very be that > the > >>> weight will change depending on the exception. > >>> > >>> On Sun, Apr 19, 2026 at 6:30 AM Shahar Epstein <[email protected]> > wrote: > >>> > >>>> Great idea! I liked both the deterministic approach as well as the AI > >>>> integrated. > >>>> > >>>> > >>>> Shahar > >>>> > >>>> On Sat, Apr 18, 2026 at 3:02 AM Kaxil Naik <[email protected]> > wrote: > >>>> > >>>>> Hi all, > >>>>> > >>>>> Continuing the push to make Airflow AI-native, I have put together > >>>> AIP-105: > >>>>> Pluggable Retry Policies. > >>>>> > >>>>> Wiki: > >>>>> > >>>>> > >>>> > https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-105%3A+Pluggable+Retry+Policies > >>>>> PR (core): https://github.com/apache/airflow/pull/65450 > >>>>> PR (LLM-powered, common-ai provider): > >>>>> https://github.com/apache/airflow/pull/65451 > >>>>> > >>>>> The problem is straightforward: Airflow retries every failure the > same > >>>> way. > >>>>> An expired API key gets retried 3 times over 15 minutes. A > rate-limited > >>>> API > >>>>> gets retried immediately, hitting the same 429. Users who want > smarter > >>>>> retries today have to wrap every task in try/except and raise > >>>>> AirflowFailException manually, mixing retry logic into business > logic. > >>>>> > >>>>> This AIP adds a retry_policy parameter to BaseOperator. The policy > >>>>> evaluates the actual exception at failure time and returns RETRY > (with a > >>>>> custom delay), FAIL (skip remaining retries), or DEFAULT (standard > >>>>> behaviour). It runs in the worker process, not the scheduler. > >>>>> > >>>>> Declarative example: > >>>>> > >>>>> ```python > >>>>> @task( > >>>>> retries=5, > >>>>> retry_policy=ExceptionRetryPolicy( > >>>>> rules=[ > >>>>> RetryRule( > >>>>> exception="requests.exceptions.HTTPError", > >>>>> action=RetryAction.RETRY, > >>>>> retry_delay=timedelta(minutes=5) > >>>>> ), > >>>>> RetryRule( > >>>>> exception="google.auth.exceptions.RefreshError", > >>>>> action=RetryAction.FAIL > >>>>> ), > >>>>> ] > >>>>> ), > >>>>> ) > >>>>> def call_api(): > >>>>> ... > >>>>> ``` > >>>>> > >>>>> LLM-powered example -- uses any pydantic-ai provider (OpenAI, > Anthropic, > >>>>> Bedrock, Ollama): > >>>>> > >>>>> @task(retries=5, retry_policy=(llm_conn_id="my_llm")) > >>>>> def call_flaky_api(): ... > >>>>> > >>>>> The LLM version classifies errors into categories (auth, rate_limit, > >>>>> network, data, transient, permanent) using structured output with a > >>>>> 30-second timeout and declarative fallback rules for when the LLM > itself > >>>> is > >>>>> down. > >>>>> > >>>>> I have attached demo videos and screenshots to both PRs showing both > >>>>> policies running end-to-end in Airflow -- including the LLM correctly > >>>>> classifying 4 different error types via Claude Haiku. > >>>>> > >>>>> Full design, done criteria, and implementation details are in the > wiki > >>>> page > >>>>> above. > >>>>> > >>>>> Feedback welcome. > >>>>> > >>>>> Thanks, > >>>>> Kaxil > >>>>> > >> --------------------------------------------------------------------- > >> To unsubscribe, e-mail: [email protected] > >> For additional commands, e-mail: [email protected] > >> > > > > --------------------------------------------------------------------- > > To unsubscribe, e-mail: [email protected] > > For additional commands, e-mail: [email protected] > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [email protected] > For additional commands, e-mail: [email protected] > >
