Re: [DISCUSS] Infrastructure-Aware Task Execution and Resumable Operators - Proposals for Reliability

Stefan Wang Thu, 04 Dec 2025 03:22:52 -0800

Re: https://lists.apache.org/thread/z5swo1vrtwxpgzwbrnf5h7xyqwtlrprx


Hi Christos,

Thank you so much for the feedback and interest! Really appreciate you taking 
the time to review the proposals.

You mentioned you'd like to see a code POC for the Context Propagation and 
Infra Retry Budget (Infrastructure-Aware Task Execution) 
<https://docs.google.com/document/d/1BAOJTAPfWK93JnN6LQrISo8IqDiE7LpnfG2Q42fnn7M/edit?usp=sharing>
 proposal—I believe that's a great way to make these concepts more concrete.

If it helps the community get a better sense of what this looks like, I'll try 
to put up a draft PR in the coming weeks to showcase the idea better.

Thanks again!

Best,
Stefan

> On Dec 4, 2025, at 2:47 AM, Stefan Wang <[email protected]> wrote:
> 
> Hi Jarek, Maciej, and everyone,
> 
> Thank you all for the incredibly thoughtful analysis! After diving deep into 
> the source code and carefully comparing the different proposals, I want to 
> understand a bit more about what problems we are solving, and share my 
> learnings on this topic.
> (btw I also wanted to give a quick acknowledge that the ideas in my proposals 
> incorporated valuable insights from offline conversations with Ping Zhang and 
> Howie Wang from OpenAI's Airflow team)
> 
> Problem Space:
> After reviewing the current airflow src code and proposals in detail, I 
> realize these aren't exactly solving the same problem. Let me break down what 
> I see:
> 
> Category 1:
> Scoped Intra-Task-Instance State (Retries within the SAME task instance)
> XD's AIP draft 
> <https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=399278333>: 
> Preserve external job (ID) across retry to avoid duplicate submissions (via 
> XCOM)
> Resumable Operators 
> <https://docs.google.com/document/d/1XPbCfuTVhyiq12tFxbyQrX_kQqrDqMo5t7M789MG4GI>:
>  Preserve external job (ID) during infra disruptions (with context-aware 
> cleanup) via TaskInstance’s next_method and next_method_kwargs (and introduce 
> a RESUMING state to leverage the same “re-launch” logic from Deferrable - 
> this is an existing persistence mechanism)
> 
> Category 2:
> Cross-Run State (DIFFERENT DAG runs)
> Asset Watermarks: Triggers need "last processed" state across multiple DAG 
> runs (e.g., "last S3 file timestamp")
> This to me looks different—it's not about retries within a task instance, but 
> incremental processing across separate executions. (pls correct me if this is 
> wrong)
> 
> Category 3:
> Post-Execution Async State
> OpenLineage use case: Store lineage data for async emission after task 
> completion (pls correct me if this is wrong)
> 
> So rather than claiming to drive "all state persistence”, would it be better 
> for me to volunteer to drive specifically - Intra-Task-Instance State 
> Persistence (Retry Scope) story, potentially jointly with XD
> And looks like Jake and Guangyang are already driving the cross-run state 
> (asset watermarks) story, which I agree is the right separation of concerns.
> 
> Very open to feedback here and would love to hear thoughts on this. Thanks a 
> lot folks!
> 
> Best,
> Stefan
> 
>> On Dec 2, 2025, at 4:40 AM, Jarek Potiuk <[email protected]> wrote:
>> 
>> So looks like we have MORE people who would like to join the efforts :D
>> 
>> On Tue, Dec 2, 2025 at 1:35 PM Maciej Obuchowski <[email protected]>
>> wrote:
>> 
>>> Just to add to the pile of use cases:
>>> that mechanism would also be useful for listeners/OpenLineage integration,
>>> to store the necessary lineage data post-execution, to be able to send the
>>> OpenLineage events asynchronously, rather than running on
>>> worker and blocking execution slot.
>>> 
>>> Thanks,
>>> Maciej
>>> 
>>> wt., 2 gru 2025 o 10:45 Jarek Potiuk <[email protected]> napisał(a):
>>> 
>>>> One comment here. I looked yesterday again at your proposals, and they
>>> are
>>>> really well thought out.
>>>> One thing however that I see in it is something of a recurring pattern we
>>>> have in many discussions:
>>>> 
>>>> *Storing state in Airflow*
>>>> 
>>>> This has been discussed in a number of discussions in the past (recent
>>> and
>>>> not-so-recent). I tried to put them together here (in reverse
>>> chronological
>>>> order):
>>>> 
>>>> * XD's discussion: `Add "persist_xcom_through_retry" Parameter to Airflow
>>>> Operators` here
>>>> https://lists.apache.org/thread/yqbtw5l8cpjln4sw7m4x73qb9tffysky
>>>> * Your proposal here - partially -  Infrastructure-Aware Task Execution
>>> and
>>>> Resumable Operators
>>>> * Jake and Guangyang Li  - [WIP] AIP-93 Asset Watermarks and State
>>>> Variables
>>>> https://lists.apache.org/thread/vftpzrwb34xr2xbfsx7qtbxn5w6h3f2b
>>>> * Daniels old "State Persistence" AIP ->
>>>> 
>>>> 
>>> https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-30%3A+State+persistence
>>>> 
>>>> Likely more.
>>>> 
>>>> I think it's fairly clear that we need State persistence. And there are
>>>> various way people wanted to address it:
>>>> 
>>>> * XD's proposal was to piggyback on Xcoms and add options to not delete
>>>> them on resume
>>>> * Jake and Guangyang - proposed State Variables that would be bound with
>>>> Assets
>>>> * Daniel proposed a broader AIP that solves persistence need potentially
>>> on
>>>> various levels (task, dag variable, etc. ) - with proposal to use
>>> separate
>>>> ProcessState, TaskState, and TaskInstanceState (solutions 3, 5 and 6).
>>> Also
>>>> probably now that would extend to AssetState if it is followed
>>>> 
>>>> Maybe it's a good time to join the efforts and propose a single solution
>>>> that can help to address all those "state persistence" needs ?
>>>> 
>>>> I think we have now enough concrete use cases - from the above proposals
>>>> and probably more, to make a single proposal that will be usable to
>>> address
>>>> all of the needs. We have a number of smart people who - if they discuss
>>>> and work together on a single solution, might likely come to a good
>>>> proposal **just** on state persistence that will be usable for all those
>>>> cases ?
>>>> 
>>>> If you were to break your proposals Stefan into smaller pieces and
>>>> incremental deliverables, I would say - getting this one done is not only
>>>> moving your ideas forward, but also it moves many other ideas forward
>>> that
>>>> could be implemented in parallel as next step after this "foundational"
>>>> state persistence is added with some very simple use case to start with.
>>>> That would make it perfect approach - band together to make a
>>> foundational
>>>> feature, so that then you can split off and work on all those other ideas
>>>> in parallel.
>>>> 
>>>> We just need someone to volunteer and lead the efforts - and others here
>>> to
>>>> join and do the work together.
>>>> 
>>>> J.
>>>> 
>>>> 
>>>> On Tue, Dec 2, 2025 at 9:49 AM Stefan Wang <[email protected]> wrote:
>>>> 
>>>>> Re: https://lists.apache.org/thread/jk1wkt1wh0lm2ovlldnfcpbzr3brxsy1
>>>>> 
>>>>> Thank you Jarek for the thoughtful guidance — I really appreciate you
>>>>> taking the time to guide me through this. Totally agree with your
>>> advice
>>>>> about starting small and building things incrementally, and I'll keep
>>> it
>>>> in
>>>>> mind throughout this effort.
>>>>> 
>>>>> The proposals aims to address shared reliability challenges that have
>>>> been
>>>>> seeing across medium to large scale Airflow deployments in the
>>> community
>>>>> (ref: OpenAI 2025 Airflow Summit Talk <
>>>>> https://airflowsummit.org/sessions/2025/airflow-openai/> (Reliability
>>>>> Section), LinkedIn (here in this thread), and Apple with Xiaodong's
>>>>> thread/AIP <
>>>>> https://lists.apache.org/thread/yqbtw5l8cpjln4sw7m4x73qb9tffysky>
>>>>> (specifically External Job Tracking and Polling) - I’ll follow up in
>>>> there
>>>>> as well to collaborate):
>>>>> 
>>>>> Better Context propagation and Infra Retry budget: Help distinguish
>>>>> infrastructure failures (pod evictions, worker crashes) from
>>> application
>>>>> errors for smarter cleanup decisions and protected user retry budgets -
>>>> we
>>>>> already have access to the SOT context - just need to propagate it
>>> better
>>>>> in the existing ecosystem (through passing additional optional msg or
>>>>> exception handling, or something else)
>>>>> 
>>>>> Resumable Operators (in parallel with Deferrable Operators): Let
>>>> operators
>>>>> reconnect to healthy external jobs (Databricks, EMR) after worker
>>>>> disruptions instead of wastefully restarting
>>>>> 
>>>>> Both are designed to be completely backward compatible, opt-in only,
>>> and
>>>>> designed with specific leverage on existing well-established Airflow
>>>>> features, hooks, and patterns (deferral mechanism, execution context).
>>>>> 
>>>>> Rather than pushing for big changes upfront in one go, throughout this
>>>>> effort, things will be broken into small, incremental pieces that each
>>>>> provide standalone value. Start with the tiniest possible change (e.g.,
>>>>> optional execution_context parameter — purely additive). Continue
>>>>> contributing in other areas especially reliability-related, to maintain
>>>>> consistency and trust. Keep the broader vision in the design proposal,
>>>> but
>>>>> let the implementation evolve based on community feedback.
>>>>> 
>>>>> I want to make sure this is done in a way that's most beneficial to the
>>>>> community. Guidance and support from you and others in the community
>>>>> overall will help us a lot in approaching this the right way. Thank
>>> you!
>>>>> 
>>>>> Best,
>>>>> Stefan
>>>>> 
>>>>> 
>>>>>> On Dec 2, 2025, at 12:24 AM, Stefan Wang <[email protected]> wrote:
>>>>>> 
>>>>>> Hi Jens,
>>>>>> 
>>>>>> Thank you so much for the help and for being so supportive — it’s
>>>>> working for me now!
>>>>>> 
>>>>>> Really appreciate you stepping in.
>>>>>> 
>>>>>> Best,
>>>>>> Stefan
>>>>>> 
>>>>>> 
>>>>>>> On Nov 30, 2025, at 12:27 AM, Jens Scheffler <[email protected]>
>>>>> wrote:
>>>>>>> 
>>>>>>> As PMC we are space owners, added your permissions for the user
>>>>> stefwang to the Airflow space. Hope now it is working.
>>>>>>> 
>>>>>>> On 11/30/25 04:54, Stefan Wang wrote:
>>>>>>>> Apologies for the late response folks while I had oncall shifts.
>>>>> Catching up here and will respond to each comment in order.
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> —
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> Re:
>>> https://lists.apache.org/thread/j02owr28cjw7zyyrp938fqt69nbmyxy4
>>>>> from Jens Scheffler
>>>>>>>> 
>>>>>>>> Hi Jens,
>>>>>>>> 
>>>>>>>> Thanks for the suggestion! I completely agree that following the
>>>>> formal AIP process is the right approach.
>>>>>>>> 
>>>>>>>> I've been trying to create the AIPs on the Confluence wiki, but I'm
>>>>> running into permission issues. When I click the "Create new AIP"
>>> button
>>>> on
>>>>> the AIP page, I get a "Sorry, you don't have permission to create
>>>> content"
>>>>> error.
>>>>>>>> 
>>>>>>>> I've tried following the exact step listed to create ASF confluence
>>>>> account however neither has EDIT access granted under the AIRFLOW
>>> space,
>>>>> created two accounts (stewang and stefwang) to rule out any
>>>>> account-specific issues, but both accounts have the same problem. Would
>>>>> really appreciate some expertise in this area to help point me to who
>>> we
>>>>> should contact to get the appropriate permissions, or is there a
>>> specific
>>>>> access request process I should follow? - Or if someone else with edit
>>>>> access could help copy paste the google doc content into Confluence for
>>>>> comments, thanks a lot!
>>>>>>>> 
>>>>>>>> I’ll try to contact ASF infra support in the mean time, and will
>>> work
>>>>> on migrate the Google Docs to Confluence once I have access.
>>>>>>>> 
>>>>>>>> Thanks, Stefan
>>>>>>>> 
>>>>>>>> 
>>>>>>>>> On Nov 15, 2025, at 6:27 AM, Christos Bisias <
>>> [email protected]
>>>>> 
>>>>> wrote:
>>>>>>>>> 
>>>>>>>>> Hi Stefan,
>>>>>>>>> 
>>>>>>>>> Thank you for the work! Very well organised and easy to follow
>>> docs.
>>>>>>>>> 
>>>>>>>>> I have been thinking about infrastructure retries for a while now.
>>>>> Also, I
>>>>>>>>> had a few discussions at the Airflow Summit last month and I know
>>>> that
>>>>>>>>> others are interested as well.
>>>>>>>>> 
>>>>>>>>> It looks to me too, that this will be split into multiple PRs but
>>> if
>>>>> there
>>>>>>>>> is a code POC, I would like to take a look.
>>>>>>>>> 
>>>>>>>>> Regards,
>>>>>>>>> Christos
>>>>>>>>> 
>>>>>>>>> On Fri, Nov 14, 2025 at 11:53 PM Jarek Potiuk <[email protected]>
>>>>> wrote:
>>>>>>>>> 
>>>>>>>>>> Also something we discussed off-line: I think the scope of it is
>>>>> quite
>>>>>>>>>> "huge" - but there are small and incremental improvements, that
>>>>> might not
>>>>>>>>>> even require AIP that can be implemented as PRs., I think it's
>>>> great
>>>>> to
>>>>>>>>>> keep "big hairy vision" in head (like I did several years ago
>>> when
>>>> I
>>>>>>>>>> proposed a "small" improvement in our dependency management that
>>>>> took about
>>>>>>>>>> 4 years to get to the stage I thought it would take a few weeks.
>>>>>>>>>> 
>>>>>>>>>> Getting incremental improvements and showing the dedication,
>>> merit
>>>>> and
>>>>>>>>>> consistent pattern of improvements is a key to get - eventually -
>>>>> big and
>>>>>>>>>> "world-changing" changes.
>>>>>>>>>> 
>>>>>>>>>> J.
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> On Fri, Nov 14, 2025 at 10:32 PM Jens Scheffler <
>>>> [email protected]
>>>>>> 
>>>>>>>>>> wrote:
>>>>>>>>>> 
>>>>>>>>>>> Hi Stefan,
>>>>>>>>>>> 
>>>>>>>>>>> thanks for dropping the proposals!
>>>>>>>>>>> 
>>>>>>>>>>> I'd propose to store the documents in cWiki and open them
>>> formally
>>>>> in
>>>>>>>>>>> there as AIP proposal as then it is sollowing the AIP process.
>>>>>>>>>>> 
>>>>>>>>>>> See
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>> 
>>>>> 
>>>> 
>>> https://cwiki.apache.org/confluence/display/AIRFLOW/Airflow+Improvement+Proposals
>>>>>>>>>>> Jens
>>>>>>>>>>> 
>>>>>>>>>>> On 11/14/25 12:35, Stefan Wang wrote:
>>>>>>>>>>>> Hi Airflow Community,
>>>>>>>>>>>> 
>>>>>>>>>>>> I'm excited to share two complementary proposals that address
>>>>> critical
>>>>>>>>>>> reliability challenges in Airflow, particularly around
>>>>> infrastructure
>>>>>>>>>>> disruptions and task resilience. These proposals build on
>>> insights
>>>>> from
>>>>>>>>>>> managing one of the larger Airflow deployments (20k+ DAGs, 100k+
>>>>> daily
>>>>>>>>>> task
>>>>>>>>>>> executions per cluster).
>>>>>>>>>>>> Proposals
>>>>>>>>>>>> 
>>>>>>>>>>>> 1. Infrastructure-Aware Task Execution and Context Propagation
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>> 
>>>>> 
>>>> 
>>> https://docs.google.com/document/d/1BAOJTAPfWK93JnN6LQrISo8IqDiE7LpnfG2Q42fnn7M
>>>>>>>>>>>> 2. Resumable Operators for Disruption Readiness
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>> 
>>>>> 
>>>> 
>>> https://docs.google.com/document/d/1XPbCfuTVhyiq12tFxbyQrX_kQqrDqMo5t7M789MG4GI
>>>>>>>>>>>> What We're Solving
>>>>>>>>>>>> 
>>>>>>>>>>>> Infrastructure failures consume user retries - Pod evictions
>>>>> shouldn't
>>>>>>>>>>> count against application retry budgets
>>>>>>>>>>>> Wasted computation - Worker crashes shouldn't restart healthy
>>>>> 3-hour
>>>>>>>>>>> Databricks jobs from zero
>>>>>>>>>>>> How
>>>>>>>>>>>> 
>>>>>>>>>>>> Execution Context: Distinguish infrastructure vs application
>>>>> failures
>>>>>>>>>>> for smarter retry handling
>>>>>>>>>>>> Resumable Operators: Checkpoint and reconnect to external jobs
>>>>> after
>>>>>>>>>>> disruptions (follows deferral pattern)
>>>>>>>>>>>> These approaches have significantly improved reliability and
>>> user
>>>>>>>>>>> experience, and reduced wasted costs in our production
>>>> environment.
>>>>>>>>>>>> Looking forward to your feedback on both the problems we're
>>>>> addressing
>>>>>>>>>>> and the proposed solutions. Both proposals are fully backward
>>>>> compatible
>>>>>>>>>>> and follow existing Airflow patterns.
>>>>>>>>>>>> Happy to answer any questions or dive deeper into
>>> implementation
>>>>>>>>>> details.
>>>>>>>>>>>> Best,
>>>>>>>>>>>> 
>>>>>>>>>>>> Stefan Wang
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>> ---------------------------------------------------------------------
>>>>>>>>>>> To unsubscribe, e-mail: [email protected]
>>>>>>>>>>> For additional commands, e-mail: [email protected]
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>>> 
>>> ---------------------------------------------------------------------
>>>>>>> To unsubscribe, e-mail: [email protected]
>>>>>>> For additional commands, e-mail: [email protected]
>>>>>>> 
>>>>>> 
>>>>> 
>>>>> 
>>>> 
>>> 
>

Re: [DISCUSS] Infrastructure-Aware Task Execution and Resumable Operators - Proposals for Reliability

Reply via email to