Re: https://lists.apache.org/thread/z5swo1vrtwxpgzwbrnf5h7xyqwtlrprx
Hi Christos, Thank you so much for the feedback and interest! Really appreciate you taking the time to review the proposals. You mentioned you'd like to see a code POC for the Context Propagation and Infra Retry Budget (Infrastructure-Aware Task Execution) <https://docs.google.com/document/d/1BAOJTAPfWK93JnN6LQrISo8IqDiE7LpnfG2Q42fnn7M/edit?usp=sharing> proposal—I believe that's a great way to make these concepts more concrete. If it helps the community get a better sense of what this looks like, I'll try to put up a draft PR in the coming weeks to showcase the idea better. Thanks again! Best, Stefan > On Dec 4, 2025, at 2:47 AM, Stefan Wang <[email protected]> wrote: > > Hi Jarek, Maciej, and everyone, > > Thank you all for the incredibly thoughtful analysis! After diving deep into > the source code and carefully comparing the different proposals, I want to > understand a bit more about what problems we are solving, and share my > learnings on this topic. > (btw I also wanted to give a quick acknowledge that the ideas in my proposals > incorporated valuable insights from offline conversations with Ping Zhang and > Howie Wang from OpenAI's Airflow team) > > Problem Space: > After reviewing the current airflow src code and proposals in detail, I > realize these aren't exactly solving the same problem. Let me break down what > I see: > > Category 1: > Scoped Intra-Task-Instance State (Retries within the SAME task instance) > XD's AIP draft > <https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=399278333>: > Preserve external job (ID) across retry to avoid duplicate submissions (via > XCOM) > Resumable Operators > <https://docs.google.com/document/d/1XPbCfuTVhyiq12tFxbyQrX_kQqrDqMo5t7M789MG4GI>: > Preserve external job (ID) during infra disruptions (with context-aware > cleanup) via TaskInstance’s next_method and next_method_kwargs (and introduce > a RESUMING state to leverage the same “re-launch” logic from Deferrable - > this is an existing persistence mechanism) > > Category 2: > Cross-Run State (DIFFERENT DAG runs) > Asset Watermarks: Triggers need "last processed" state across multiple DAG > runs (e.g., "last S3 file timestamp") > This to me looks different—it's not about retries within a task instance, but > incremental processing across separate executions. (pls correct me if this is > wrong) > > Category 3: > Post-Execution Async State > OpenLineage use case: Store lineage data for async emission after task > completion (pls correct me if this is wrong) > > So rather than claiming to drive "all state persistence”, would it be better > for me to volunteer to drive specifically - Intra-Task-Instance State > Persistence (Retry Scope) story, potentially jointly with XD > And looks like Jake and Guangyang are already driving the cross-run state > (asset watermarks) story, which I agree is the right separation of concerns. > > Very open to feedback here and would love to hear thoughts on this. Thanks a > lot folks! > > Best, > Stefan > >> On Dec 2, 2025, at 4:40 AM, Jarek Potiuk <[email protected]> wrote: >> >> So looks like we have MORE people who would like to join the efforts :D >> >> On Tue, Dec 2, 2025 at 1:35 PM Maciej Obuchowski <[email protected]> >> wrote: >> >>> Just to add to the pile of use cases: >>> that mechanism would also be useful for listeners/OpenLineage integration, >>> to store the necessary lineage data post-execution, to be able to send the >>> OpenLineage events asynchronously, rather than running on >>> worker and blocking execution slot. >>> >>> Thanks, >>> Maciej >>> >>> wt., 2 gru 2025 o 10:45 Jarek Potiuk <[email protected]> napisał(a): >>> >>>> One comment here. I looked yesterday again at your proposals, and they >>> are >>>> really well thought out. >>>> One thing however that I see in it is something of a recurring pattern we >>>> have in many discussions: >>>> >>>> *Storing state in Airflow* >>>> >>>> This has been discussed in a number of discussions in the past (recent >>> and >>>> not-so-recent). I tried to put them together here (in reverse >>> chronological >>>> order): >>>> >>>> * XD's discussion: `Add "persist_xcom_through_retry" Parameter to Airflow >>>> Operators` here >>>> https://lists.apache.org/thread/yqbtw5l8cpjln4sw7m4x73qb9tffysky >>>> * Your proposal here - partially - Infrastructure-Aware Task Execution >>> and >>>> Resumable Operators >>>> * Jake and Guangyang Li - [WIP] AIP-93 Asset Watermarks and State >>>> Variables >>>> https://lists.apache.org/thread/vftpzrwb34xr2xbfsx7qtbxn5w6h3f2b >>>> * Daniels old "State Persistence" AIP -> >>>> >>>> >>> https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-30%3A+State+persistence >>>> >>>> Likely more. >>>> >>>> I think it's fairly clear that we need State persistence. And there are >>>> various way people wanted to address it: >>>> >>>> * XD's proposal was to piggyback on Xcoms and add options to not delete >>>> them on resume >>>> * Jake and Guangyang - proposed State Variables that would be bound with >>>> Assets >>>> * Daniel proposed a broader AIP that solves persistence need potentially >>> on >>>> various levels (task, dag variable, etc. ) - with proposal to use >>> separate >>>> ProcessState, TaskState, and TaskInstanceState (solutions 3, 5 and 6). >>> Also >>>> probably now that would extend to AssetState if it is followed >>>> >>>> Maybe it's a good time to join the efforts and propose a single solution >>>> that can help to address all those "state persistence" needs ? >>>> >>>> I think we have now enough concrete use cases - from the above proposals >>>> and probably more, to make a single proposal that will be usable to >>> address >>>> all of the needs. We have a number of smart people who - if they discuss >>>> and work together on a single solution, might likely come to a good >>>> proposal **just** on state persistence that will be usable for all those >>>> cases ? >>>> >>>> If you were to break your proposals Stefan into smaller pieces and >>>> incremental deliverables, I would say - getting this one done is not only >>>> moving your ideas forward, but also it moves many other ideas forward >>> that >>>> could be implemented in parallel as next step after this "foundational" >>>> state persistence is added with some very simple use case to start with. >>>> That would make it perfect approach - band together to make a >>> foundational >>>> feature, so that then you can split off and work on all those other ideas >>>> in parallel. >>>> >>>> We just need someone to volunteer and lead the efforts - and others here >>> to >>>> join and do the work together. >>>> >>>> J. >>>> >>>> >>>> On Tue, Dec 2, 2025 at 9:49 AM Stefan Wang <[email protected]> wrote: >>>> >>>>> Re: https://lists.apache.org/thread/jk1wkt1wh0lm2ovlldnfcpbzr3brxsy1 >>>>> >>>>> Thank you Jarek for the thoughtful guidance — I really appreciate you >>>>> taking the time to guide me through this. Totally agree with your >>> advice >>>>> about starting small and building things incrementally, and I'll keep >>> it >>>> in >>>>> mind throughout this effort. >>>>> >>>>> The proposals aims to address shared reliability challenges that have >>>> been >>>>> seeing across medium to large scale Airflow deployments in the >>> community >>>>> (ref: OpenAI 2025 Airflow Summit Talk < >>>>> https://airflowsummit.org/sessions/2025/airflow-openai/> (Reliability >>>>> Section), LinkedIn (here in this thread), and Apple with Xiaodong's >>>>> thread/AIP < >>>>> https://lists.apache.org/thread/yqbtw5l8cpjln4sw7m4x73qb9tffysky> >>>>> (specifically External Job Tracking and Polling) - I’ll follow up in >>>> there >>>>> as well to collaborate): >>>>> >>>>> Better Context propagation and Infra Retry budget: Help distinguish >>>>> infrastructure failures (pod evictions, worker crashes) from >>> application >>>>> errors for smarter cleanup decisions and protected user retry budgets - >>>> we >>>>> already have access to the SOT context - just need to propagate it >>> better >>>>> in the existing ecosystem (through passing additional optional msg or >>>>> exception handling, or something else) >>>>> >>>>> Resumable Operators (in parallel with Deferrable Operators): Let >>>> operators >>>>> reconnect to healthy external jobs (Databricks, EMR) after worker >>>>> disruptions instead of wastefully restarting >>>>> >>>>> Both are designed to be completely backward compatible, opt-in only, >>> and >>>>> designed with specific leverage on existing well-established Airflow >>>>> features, hooks, and patterns (deferral mechanism, execution context). >>>>> >>>>> Rather than pushing for big changes upfront in one go, throughout this >>>>> effort, things will be broken into small, incremental pieces that each >>>>> provide standalone value. Start with the tiniest possible change (e.g., >>>>> optional execution_context parameter — purely additive). Continue >>>>> contributing in other areas especially reliability-related, to maintain >>>>> consistency and trust. Keep the broader vision in the design proposal, >>>> but >>>>> let the implementation evolve based on community feedback. >>>>> >>>>> I want to make sure this is done in a way that's most beneficial to the >>>>> community. Guidance and support from you and others in the community >>>>> overall will help us a lot in approaching this the right way. Thank >>> you! >>>>> >>>>> Best, >>>>> Stefan >>>>> >>>>> >>>>>> On Dec 2, 2025, at 12:24 AM, Stefan Wang <[email protected]> wrote: >>>>>> >>>>>> Hi Jens, >>>>>> >>>>>> Thank you so much for the help and for being so supportive — it’s >>>>> working for me now! >>>>>> >>>>>> Really appreciate you stepping in. >>>>>> >>>>>> Best, >>>>>> Stefan >>>>>> >>>>>> >>>>>>> On Nov 30, 2025, at 12:27 AM, Jens Scheffler <[email protected]> >>>>> wrote: >>>>>>> >>>>>>> As PMC we are space owners, added your permissions for the user >>>>> stefwang to the Airflow space. Hope now it is working. >>>>>>> >>>>>>> On 11/30/25 04:54, Stefan Wang wrote: >>>>>>>> Apologies for the late response folks while I had oncall shifts. >>>>> Catching up here and will respond to each comment in order. >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> — >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> Re: >>> https://lists.apache.org/thread/j02owr28cjw7zyyrp938fqt69nbmyxy4 >>>>> from Jens Scheffler >>>>>>>> >>>>>>>> Hi Jens, >>>>>>>> >>>>>>>> Thanks for the suggestion! I completely agree that following the >>>>> formal AIP process is the right approach. >>>>>>>> >>>>>>>> I've been trying to create the AIPs on the Confluence wiki, but I'm >>>>> running into permission issues. When I click the "Create new AIP" >>> button >>>> on >>>>> the AIP page, I get a "Sorry, you don't have permission to create >>>> content" >>>>> error. >>>>>>>> >>>>>>>> I've tried following the exact step listed to create ASF confluence >>>>> account however neither has EDIT access granted under the AIRFLOW >>> space, >>>>> created two accounts (stewang and stefwang) to rule out any >>>>> account-specific issues, but both accounts have the same problem. Would >>>>> really appreciate some expertise in this area to help point me to who >>> we >>>>> should contact to get the appropriate permissions, or is there a >>> specific >>>>> access request process I should follow? - Or if someone else with edit >>>>> access could help copy paste the google doc content into Confluence for >>>>> comments, thanks a lot! >>>>>>>> >>>>>>>> I’ll try to contact ASF infra support in the mean time, and will >>> work >>>>> on migrate the Google Docs to Confluence once I have access. >>>>>>>> >>>>>>>> Thanks, Stefan >>>>>>>> >>>>>>>> >>>>>>>>> On Nov 15, 2025, at 6:27 AM, Christos Bisias < >>> [email protected] >>>>> >>>>> wrote: >>>>>>>>> >>>>>>>>> Hi Stefan, >>>>>>>>> >>>>>>>>> Thank you for the work! Very well organised and easy to follow >>> docs. >>>>>>>>> >>>>>>>>> I have been thinking about infrastructure retries for a while now. >>>>> Also, I >>>>>>>>> had a few discussions at the Airflow Summit last month and I know >>>> that >>>>>>>>> others are interested as well. >>>>>>>>> >>>>>>>>> It looks to me too, that this will be split into multiple PRs but >>> if >>>>> there >>>>>>>>> is a code POC, I would like to take a look. >>>>>>>>> >>>>>>>>> Regards, >>>>>>>>> Christos >>>>>>>>> >>>>>>>>> On Fri, Nov 14, 2025 at 11:53 PM Jarek Potiuk <[email protected]> >>>>> wrote: >>>>>>>>> >>>>>>>>>> Also something we discussed off-line: I think the scope of it is >>>>> quite >>>>>>>>>> "huge" - but there are small and incremental improvements, that >>>>> might not >>>>>>>>>> even require AIP that can be implemented as PRs., I think it's >>>> great >>>>> to >>>>>>>>>> keep "big hairy vision" in head (like I did several years ago >>> when >>>> I >>>>>>>>>> proposed a "small" improvement in our dependency management that >>>>> took about >>>>>>>>>> 4 years to get to the stage I thought it would take a few weeks. >>>>>>>>>> >>>>>>>>>> Getting incremental improvements and showing the dedication, >>> merit >>>>> and >>>>>>>>>> consistent pattern of improvements is a key to get - eventually - >>>>> big and >>>>>>>>>> "world-changing" changes. >>>>>>>>>> >>>>>>>>>> J. >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On Fri, Nov 14, 2025 at 10:32 PM Jens Scheffler < >>>> [email protected] >>>>>> >>>>>>>>>> wrote: >>>>>>>>>> >>>>>>>>>>> Hi Stefan, >>>>>>>>>>> >>>>>>>>>>> thanks for dropping the proposals! >>>>>>>>>>> >>>>>>>>>>> I'd propose to store the documents in cWiki and open them >>> formally >>>>> in >>>>>>>>>>> there as AIP proposal as then it is sollowing the AIP process. >>>>>>>>>>> >>>>>>>>>>> See >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>> >>>>> >>>> >>> https://cwiki.apache.org/confluence/display/AIRFLOW/Airflow+Improvement+Proposals >>>>>>>>>>> Jens >>>>>>>>>>> >>>>>>>>>>> On 11/14/25 12:35, Stefan Wang wrote: >>>>>>>>>>>> Hi Airflow Community, >>>>>>>>>>>> >>>>>>>>>>>> I'm excited to share two complementary proposals that address >>>>> critical >>>>>>>>>>> reliability challenges in Airflow, particularly around >>>>> infrastructure >>>>>>>>>>> disruptions and task resilience. These proposals build on >>> insights >>>>> from >>>>>>>>>>> managing one of the larger Airflow deployments (20k+ DAGs, 100k+ >>>>> daily >>>>>>>>>> task >>>>>>>>>>> executions per cluster). >>>>>>>>>>>> Proposals >>>>>>>>>>>> >>>>>>>>>>>> 1. Infrastructure-Aware Task Execution and Context Propagation >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>> >>>>> >>>> >>> https://docs.google.com/document/d/1BAOJTAPfWK93JnN6LQrISo8IqDiE7LpnfG2Q42fnn7M >>>>>>>>>>>> 2. Resumable Operators for Disruption Readiness >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>> >>>>> >>>> >>> https://docs.google.com/document/d/1XPbCfuTVhyiq12tFxbyQrX_kQqrDqMo5t7M789MG4GI >>>>>>>>>>>> What We're Solving >>>>>>>>>>>> >>>>>>>>>>>> Infrastructure failures consume user retries - Pod evictions >>>>> shouldn't >>>>>>>>>>> count against application retry budgets >>>>>>>>>>>> Wasted computation - Worker crashes shouldn't restart healthy >>>>> 3-hour >>>>>>>>>>> Databricks jobs from zero >>>>>>>>>>>> How >>>>>>>>>>>> >>>>>>>>>>>> Execution Context: Distinguish infrastructure vs application >>>>> failures >>>>>>>>>>> for smarter retry handling >>>>>>>>>>>> Resumable Operators: Checkpoint and reconnect to external jobs >>>>> after >>>>>>>>>>> disruptions (follows deferral pattern) >>>>>>>>>>>> These approaches have significantly improved reliability and >>> user >>>>>>>>>>> experience, and reduced wasted costs in our production >>>> environment. >>>>>>>>>>>> Looking forward to your feedback on both the problems we're >>>>> addressing >>>>>>>>>>> and the proposed solutions. Both proposals are fully backward >>>>> compatible >>>>>>>>>>> and follow existing Airflow patterns. >>>>>>>>>>>> Happy to answer any questions or dive deeper into >>> implementation >>>>>>>>>> details. >>>>>>>>>>>> Best, >>>>>>>>>>>> >>>>>>>>>>>> Stefan Wang >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>> >>>>> --------------------------------------------------------------------- >>>>>>>>>>> To unsubscribe, e-mail: [email protected] >>>>>>>>>>> For additional commands, e-mail: [email protected] >>>>>>>>>>> >>>>>>>>>>> >>>>>>>> >>>>>>> >>>>>>> >>> --------------------------------------------------------------------- >>>>>>> To unsubscribe, e-mail: [email protected] >>>>>>> For additional commands, e-mail: [email protected] >>>>>>> >>>>>> >>>>> >>>>> >>>> >>> >
