Re: [DISCUSS] AIP-105: Pluggable Retry Policies

Kaxil Naik Tue, 21 Apr 2026 18:53:33 -0700

Thanks Stefan, and thanks Jens for the follow-up question.

On the AIP-96/97 convergence question you both raised:


I looked at AIP-97 carefully. AIP-105 and AIP-97 cover different failure
domains and don't block each other.

AIP-105's RetryPolicy runs in the worker process, after the exception is
caught in the task's try/except. It handles failures that manifest as
Python exceptions: rate limits, auth errors, connection timeouts,
transient DB errors. The worker is alive, the exception object is
available, and the policy can inspect it.

To Jens's question directly: when the worker dies (segfault, pod eviction,
OOM kill, heartbeat loss), AIP-105's policy never runs. The worker process
is gone. In that case, the existing scheduler-based retry kicks
in -- exactly as it does today. That's AIP-97's territory: the executor or
scheduler detects the failure externally and manages a separate
infrastructure retry budget.

So the split is:
- Application failures (rate limits, auth errors, data validation) raise
Python exceptions in user code -- AIP-105 handles these
- Infrastructure failures (pod eviction, OOM kill, worker heartbeat loss)
kill the worker process before any exception is caught -- AIP-97 handles
these and since it touches Scheduler / Executor -- it is more involved.

They're parallel tracks with separate execution paths.

Thanks,
Kaxil

On Tue, 21 Apr 2026 at 21:22, Jens Scheffler <[email protected]> wrote:

> Good point that Stefan made - I also had commented on the relation to
> AIP-97 which I would love to have or converge AIP-105 with.
>
> In this light, actually what would be the intend of the Retry policy if
> the worker "dies" in a segfault or loses heartbeat? Then the standard /
> existing scheduler based retry is kicking-in?
>
> Jens
>
> On 21.04.26 02:19, Stefan Wang wrote:
> > Thanks Kaxil,
> > huge +1
> >
> > This feels like a meaningful step forward.
> >
> > Giving users a way to express retry intent and putting the policy on the
> > operator is something we've needed for a while. The current options
> > aren't great: wrap everything in try/except and raise
> > AirflowFailException, or live with retries=3 as a blunt instrument.
> > Both are compromises.
> >
> > A few things that stand out in the design:
> >
> > 1. I think Evaluating on the worker is the right call. Exceptions don't
> serialize
> > cleanly across process boundaries, and keeping the decision close to
> > where the exception actually happens avoids a lot of indirection. The
> > scheduler-side version would be simpler to ship but harder to use.
> >
> > 2. The flat rule list is easier to reason about and validate at parse
> time
> > than a nested structure would be. Elad's suggestion to let one rule
> > match multiple exception types would tighten the common case without
> > losing that.
> >
> > A couple of thoughts that came up while reading:
> >
> > 1. On Paweł's testing point: if policy.evaluate() is just a method you
> can
> > call with a synthetic exception, DAG authors can cover a lot of ground
> > in unit tests. Not the same as validating in production, but catches a
> > decent amount before deploy.
> >
> > 2. On retry budgets (separate infra retry budget) more broadly:
> > retries=N today can get consumed by
> > worker evictions or heartbeat losses before any retry policy ever runs.
> > Pluggable policies will feel sharper once the user-visible budget
> > actually reflects user-domain failures. I also have two drafts touching
> this
> > area, AIP-96 (Resumable Operators) and AIP-97 (Execution Context +
> separate infra
> > retry budget), and will post updates on both soon. Open to converging
> where it makes sense.
> > For what it's worth, we've been running two related pieces in production
> > at LinkedIn. One is a mixin that preserves external jobs (Spark, Flink,
> > and similar) when the worker gets disrupted instead of cancelling them.
> > The other is a separate infrastructure retry budget set generously
> > enough that infrastructure events don't eat into user-visible retries. I
> > can share anonymized failure-category data from both if it would help
> > ground the default rule library.
> >
> > Looking forward to v2.
> >
> > — Stefan
> >
> >> On Apr 20, 2026, at 1:50 PM, Przemysław Mirowski <[email protected]>
> wrote:
> >>
> >> Great idea! Thanks for proposing it. It will make proper
> exception-retry handling much easier than it was before and will open a new
> door for more extensibility too.
> >>
> >> +1 also to the questions/concerts which Elad mentioned. Not sure though
> regarding the changes to Priority Weight (maybe part of AIP-100) and point
> 2 connected to not having full control over exception raised, looking at
> the Airflow ecosystem, all of the providers with different libraries, I
> think it is something which we should consider.
> >>
> >> One additional comment - as the Retry Policies will only run on workers
> (which is pretty nice from e.g. security point of view), I didn't see in
> AIP and PR a way to validate if configured Retry Policy will work before
> actually the time when it will be needed. That can make setting the Retry
> Policies harder and the testing them will be cumbersome. I think that
> having a nice way (from Dag Authors perspective) of testing the defined
> Retry Policy if it will actually work when it really be needed, would make
> Dag Authors lifes much easier and defining these rules much easier
> (something in some way connected to that could be testing the Airflow
> Connections and work for moving the "Test Connection" to workers). Of
> course, Retry Policies like LLM-related are rather out-of-scope, but
> testing more deterministic behaviours should be much easier to do.
> >>
> >> ________________________________
> >> From: Vincent Beck <[email protected]>
> >> Sent: 20 April 2026 15:17
> >> To: [email protected] <[email protected]>
> >> Subject: Re: [DISCUSS] AIP-105: Pluggable Retry Policies
> >>
> >> Makes a lot of sense to me!
> >>
> >> On 2026/04/19 13:56:56 Elad Kalif wrote:
> >>> Great idea!
> >>> Love it!
> >>>
> >>> I have some questions / comments:
> >>> 1. The current interface suggests rules that contain a RetryRule
> object.
> >>> but I wonder if we should change exception to exceptions and accepting
> a
> >>> list.
> >>>
> >>>         rules=[
> >>>             RetryRule(
> >>>             exceptions=["requests.exceptions.HTTPError",
> >>> "google.auth.exceptions.RefreshError"]
> >>>                     ...,
> >>> )]
> >>>
> >>> I'm thinking about a case where several exceptions need the same
> behaviour
> >>> and user may not wish to offer different reasoning for each.
> >>>
> >>> 2. Does it make sense to extend the interface for xcom values? I'm
> thinking
> >>> about a case where dag authors don't have full control over the
> exception
> >>> raised or even some upstream library changing the exception which
> results
> >>> in retry logic to be broken. Maybe we should offer also the option to
> set
> >>> retry based on previous attempt xcom value?
> >>>
> >>> 3. Maybe something for the longer run but still worth discussing - one
> of
> >>> the main motivations for custom weight rules
> >>>
> https://airflow.apache.org/docs/apache-airflow/stable/administration-and-deployment/priority-weight.html#custom-weight-rule
> >>> was to set priority based on try number. I wonder if we may want to
> somehow
> >>> combine it with the Retry rule. For retries, I can argue that the
> weight of
> >>> the task is a property of retry instructions and it can very be that
> the
> >>> weight will change depending on the exception.
> >>>
> >>> On Sun, Apr 19, 2026 at 6:30 AM Shahar Epstein <[email protected]>
> wrote:
> >>>
> >>>> Great idea! I liked both the deterministic approach as well as the AI
> >>>> integrated.
> >>>>
> >>>>
> >>>> Shahar
> >>>>
> >>>> On Sat, Apr 18, 2026 at 3:02 AM Kaxil Naik <[email protected]>
> wrote:
> >>>>
> >>>>> Hi all,
> >>>>>
> >>>>> Continuing the push to make Airflow AI-native, I have put together
> >>>> AIP-105:
> >>>>> Pluggable Retry Policies.
> >>>>>
> >>>>> Wiki:
> >>>>>
> >>>>>
> >>>>
> https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-105%3A+Pluggable+Retry+Policies
> >>>>> PR (core): https://github.com/apache/airflow/pull/65450
> >>>>> PR (LLM-powered, common-ai provider):
> >>>>> https://github.com/apache/airflow/pull/65451
> >>>>>
> >>>>> The problem is straightforward: Airflow retries every failure the
> same
> >>>> way.
> >>>>> An expired API key gets retried 3 times over 15 minutes. A
> rate-limited
> >>>> API
> >>>>> gets retried immediately, hitting the same 429. Users who want
> smarter
> >>>>> retries today have to wrap every task in try/except and raise
> >>>>> AirflowFailException manually, mixing retry logic into business
> logic.
> >>>>>
> >>>>> This AIP adds a retry_policy parameter to BaseOperator. The policy
> >>>>> evaluates the actual exception at failure time and returns RETRY
> (with a
> >>>>> custom delay), FAIL (skip remaining retries), or DEFAULT (standard
> >>>>> behaviour). It runs in the worker process, not the scheduler.
> >>>>>
> >>>>> Declarative example:
> >>>>>
> >>>>> ```python
> >>>>>     @task(
> >>>>>         retries=5,
> >>>>>         retry_policy=ExceptionRetryPolicy(
> >>>>>         rules=[
> >>>>>             RetryRule(
> >>>>>             exception="requests.exceptions.HTTPError",
> >>>>>                     action=RetryAction.RETRY,
> >>>>>                     retry_delay=timedelta(minutes=5)
> >>>>>                 ),
> >>>>>             RetryRule(
> >>>>>             exception="google.auth.exceptions.RefreshError",
> >>>>>                   action=RetryAction.FAIL
> >>>>>               ),
> >>>>>         ]
> >>>>>     ),
> >>>>>     )
> >>>>>     def call_api():
> >>>>>         ...
> >>>>> ```
> >>>>>
> >>>>> LLM-powered example -- uses any pydantic-ai provider (OpenAI,
> Anthropic,
> >>>>> Bedrock, Ollama):
> >>>>>
> >>>>>     @task(retries=5, retry_policy=(llm_conn_id="my_llm"))
> >>>>>     def call_flaky_api(): ...
> >>>>>
> >>>>> The LLM version classifies errors into categories (auth, rate_limit,
> >>>>> network, data, transient, permanent) using structured output with a
> >>>>> 30-second timeout and declarative fallback rules for when the LLM
> itself
> >>>> is
> >>>>> down.
> >>>>>
> >>>>> I have attached demo videos and screenshots to both PRs showing both
> >>>>> policies running end-to-end in Airflow -- including the LLM correctly
> >>>>> classifying 4 different error types via Claude Haiku.
> >>>>>
> >>>>> Full design, done criteria, and implementation details are in the
> wiki
> >>>> page
> >>>>> above.
> >>>>>
> >>>>> Feedback welcome.
> >>>>>
> >>>>> Thanks,
> >>>>> Kaxil
> >>>>>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: [email protected]
> >> For additional commands, e-mail: [email protected]
> >>
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: [email protected]
> > For additional commands, e-mail: [email protected]
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
>
>

Re: [DISCUSS] AIP-105: Pluggable Retry Policies

Reply via email to