So looks like we have MORE people who would like to join the efforts :D On Tue, Dec 2, 2025 at 1:35 PM Maciej Obuchowski <[email protected]> wrote:
> Just to add to the pile of use cases: > that mechanism would also be useful for listeners/OpenLineage integration, > to store the necessary lineage data post-execution, to be able to send the > OpenLineage events asynchronously, rather than running on > worker and blocking execution slot. > > Thanks, > Maciej > > wt., 2 gru 2025 o 10:45 Jarek Potiuk <[email protected]> napisał(a): > > > One comment here. I looked yesterday again at your proposals, and they > are > > really well thought out. > > One thing however that I see in it is something of a recurring pattern we > > have in many discussions: > > > > *Storing state in Airflow* > > > > This has been discussed in a number of discussions in the past (recent > and > > not-so-recent). I tried to put them together here (in reverse > chronological > > order): > > > > * XD's discussion: `Add "persist_xcom_through_retry" Parameter to Airflow > > Operators` here > > https://lists.apache.org/thread/yqbtw5l8cpjln4sw7m4x73qb9tffysky > > * Your proposal here - partially - Infrastructure-Aware Task Execution > and > > Resumable Operators > > * Jake and Guangyang Li - [WIP] AIP-93 Asset Watermarks and State > > Variables > > https://lists.apache.org/thread/vftpzrwb34xr2xbfsx7qtbxn5w6h3f2b > > * Daniels old "State Persistence" AIP -> > > > > > https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-30%3A+State+persistence > > > > Likely more. > > > > I think it's fairly clear that we need State persistence. And there are > > various way people wanted to address it: > > > > * XD's proposal was to piggyback on Xcoms and add options to not delete > > them on resume > > * Jake and Guangyang - proposed State Variables that would be bound with > > Assets > > * Daniel proposed a broader AIP that solves persistence need potentially > on > > various levels (task, dag variable, etc. ) - with proposal to use > separate > > ProcessState, TaskState, and TaskInstanceState (solutions 3, 5 and 6). > Also > > probably now that would extend to AssetState if it is followed > > > > Maybe it's a good time to join the efforts and propose a single solution > > that can help to address all those "state persistence" needs ? > > > > I think we have now enough concrete use cases - from the above proposals > > and probably more, to make a single proposal that will be usable to > address > > all of the needs. We have a number of smart people who - if they discuss > > and work together on a single solution, might likely come to a good > > proposal **just** on state persistence that will be usable for all those > > cases ? > > > > If you were to break your proposals Stefan into smaller pieces and > > incremental deliverables, I would say - getting this one done is not only > > moving your ideas forward, but also it moves many other ideas forward > that > > could be implemented in parallel as next step after this "foundational" > > state persistence is added with some very simple use case to start with. > > That would make it perfect approach - band together to make a > foundational > > feature, so that then you can split off and work on all those other ideas > > in parallel. > > > > We just need someone to volunteer and lead the efforts - and others here > to > > join and do the work together. > > > > J. > > > > > > On Tue, Dec 2, 2025 at 9:49 AM Stefan Wang <[email protected]> wrote: > > > > > Re: https://lists.apache.org/thread/jk1wkt1wh0lm2ovlldnfcpbzr3brxsy1 > > > > > > Thank you Jarek for the thoughtful guidance — I really appreciate you > > > taking the time to guide me through this. Totally agree with your > advice > > > about starting small and building things incrementally, and I'll keep > it > > in > > > mind throughout this effort. > > > > > > The proposals aims to address shared reliability challenges that have > > been > > > seeing across medium to large scale Airflow deployments in the > community > > > (ref: OpenAI 2025 Airflow Summit Talk < > > > https://airflowsummit.org/sessions/2025/airflow-openai/> (Reliability > > > Section), LinkedIn (here in this thread), and Apple with Xiaodong's > > > thread/AIP < > > > https://lists.apache.org/thread/yqbtw5l8cpjln4sw7m4x73qb9tffysky> > > > (specifically External Job Tracking and Polling) - I’ll follow up in > > there > > > as well to collaborate): > > > > > > Better Context propagation and Infra Retry budget: Help distinguish > > > infrastructure failures (pod evictions, worker crashes) from > application > > > errors for smarter cleanup decisions and protected user retry budgets - > > we > > > already have access to the SOT context - just need to propagate it > better > > > in the existing ecosystem (through passing additional optional msg or > > > exception handling, or something else) > > > > > > Resumable Operators (in parallel with Deferrable Operators): Let > > operators > > > reconnect to healthy external jobs (Databricks, EMR) after worker > > > disruptions instead of wastefully restarting > > > > > > Both are designed to be completely backward compatible, opt-in only, > and > > > designed with specific leverage on existing well-established Airflow > > > features, hooks, and patterns (deferral mechanism, execution context). > > > > > > Rather than pushing for big changes upfront in one go, throughout this > > > effort, things will be broken into small, incremental pieces that each > > > provide standalone value. Start with the tiniest possible change (e.g., > > > optional execution_context parameter — purely additive). Continue > > > contributing in other areas especially reliability-related, to maintain > > > consistency and trust. Keep the broader vision in the design proposal, > > but > > > let the implementation evolve based on community feedback. > > > > > > I want to make sure this is done in a way that's most beneficial to the > > > community. Guidance and support from you and others in the community > > > overall will help us a lot in approaching this the right way. Thank > you! > > > > > > Best, > > > Stefan > > > > > > > > > > On Dec 2, 2025, at 12:24 AM, Stefan Wang <[email protected]> wrote: > > > > > > > > Hi Jens, > > > > > > > > Thank you so much for the help and for being so supportive — it’s > > > working for me now! > > > > > > > > Really appreciate you stepping in. > > > > > > > > Best, > > > > Stefan > > > > > > > > > > > >> On Nov 30, 2025, at 12:27 AM, Jens Scheffler <[email protected]> > > > wrote: > > > >> > > > >> As PMC we are space owners, added your permissions for the user > > > stefwang to the Airflow space. Hope now it is working. > > > >> > > > >> On 11/30/25 04:54, Stefan Wang wrote: > > > >>> Apologies for the late response folks while I had oncall shifts. > > > Catching up here and will respond to each comment in order. > > > >>> > > > >>> > > > >>> > > > >>> — > > > >>> > > > >>> > > > >>> > > > >>> Re: > https://lists.apache.org/thread/j02owr28cjw7zyyrp938fqt69nbmyxy4 > > > from Jens Scheffler > > > >>> > > > >>> Hi Jens, > > > >>> > > > >>> Thanks for the suggestion! I completely agree that following the > > > formal AIP process is the right approach. > > > >>> > > > >>> I've been trying to create the AIPs on the Confluence wiki, but I'm > > > running into permission issues. When I click the "Create new AIP" > button > > on > > > the AIP page, I get a "Sorry, you don't have permission to create > > content" > > > error. > > > >>> > > > >>> I've tried following the exact step listed to create ASF confluence > > > account however neither has EDIT access granted under the AIRFLOW > space, > > > created two accounts (stewang and stefwang) to rule out any > > > account-specific issues, but both accounts have the same problem. Would > > > really appreciate some expertise in this area to help point me to who > we > > > should contact to get the appropriate permissions, or is there a > specific > > > access request process I should follow? - Or if someone else with edit > > > access could help copy paste the google doc content into Confluence for > > > comments, thanks a lot! > > > >>> > > > >>> I’ll try to contact ASF infra support in the mean time, and will > work > > > on migrate the Google Docs to Confluence once I have access. > > > >>> > > > >>> Thanks, Stefan > > > >>> > > > >>> > > > >>>> On Nov 15, 2025, at 6:27 AM, Christos Bisias < > [email protected] > > > > > > wrote: > > > >>>> > > > >>>> Hi Stefan, > > > >>>> > > > >>>> Thank you for the work! Very well organised and easy to follow > docs. > > > >>>> > > > >>>> I have been thinking about infrastructure retries for a while now. > > > Also, I > > > >>>> had a few discussions at the Airflow Summit last month and I know > > that > > > >>>> others are interested as well. > > > >>>> > > > >>>> It looks to me too, that this will be split into multiple PRs but > if > > > there > > > >>>> is a code POC, I would like to take a look. > > > >>>> > > > >>>> Regards, > > > >>>> Christos > > > >>>> > > > >>>> On Fri, Nov 14, 2025 at 11:53 PM Jarek Potiuk <[email protected]> > > > wrote: > > > >>>> > > > >>>>> Also something we discussed off-line: I think the scope of it is > > > quite > > > >>>>> "huge" - but there are small and incremental improvements, that > > > might not > > > >>>>> even require AIP that can be implemented as PRs., I think it's > > great > > > to > > > >>>>> keep "big hairy vision" in head (like I did several years ago > when > > I > > > >>>>> proposed a "small" improvement in our dependency management that > > > took about > > > >>>>> 4 years to get to the stage I thought it would take a few weeks. > > > >>>>> > > > >>>>> Getting incremental improvements and showing the dedication, > merit > > > and > > > >>>>> consistent pattern of improvements is a key to get - eventually - > > > big and > > > >>>>> "world-changing" changes. > > > >>>>> > > > >>>>> J. > > > >>>>> > > > >>>>> > > > >>>>> On Fri, Nov 14, 2025 at 10:32 PM Jens Scheffler < > > [email protected] > > > > > > > >>>>> wrote: > > > >>>>> > > > >>>>>> Hi Stefan, > > > >>>>>> > > > >>>>>> thanks for dropping the proposals! > > > >>>>>> > > > >>>>>> I'd propose to store the documents in cWiki and open them > formally > > > in > > > >>>>>> there as AIP proposal as then it is sollowing the AIP process. > > > >>>>>> > > > >>>>>> See > > > >>>>>> > > > >>>>>> > > > >>>>> > > > > > > https://cwiki.apache.org/confluence/display/AIRFLOW/Airflow+Improvement+Proposals > > > >>>>>> Jens > > > >>>>>> > > > >>>>>> On 11/14/25 12:35, Stefan Wang wrote: > > > >>>>>>> Hi Airflow Community, > > > >>>>>>> > > > >>>>>>> I'm excited to share two complementary proposals that address > > > critical > > > >>>>>> reliability challenges in Airflow, particularly around > > > infrastructure > > > >>>>>> disruptions and task resilience. These proposals build on > insights > > > from > > > >>>>>> managing one of the larger Airflow deployments (20k+ DAGs, 100k+ > > > daily > > > >>>>> task > > > >>>>>> executions per cluster). > > > >>>>>>> Proposals > > > >>>>>>> > > > >>>>>>> 1. Infrastructure-Aware Task Execution and Context Propagation > > > >>>>>>> > > > >>>>>>> > > > >>>>> > > > > > > https://docs.google.com/document/d/1BAOJTAPfWK93JnN6LQrISo8IqDiE7LpnfG2Q42fnn7M > > > >>>>>>> 2. Resumable Operators for Disruption Readiness > > > >>>>>>> > > > >>>>>>> > > > >>>>> > > > > > > https://docs.google.com/document/d/1XPbCfuTVhyiq12tFxbyQrX_kQqrDqMo5t7M789MG4GI > > > >>>>>>> What We're Solving > > > >>>>>>> > > > >>>>>>> Infrastructure failures consume user retries - Pod evictions > > > shouldn't > > > >>>>>> count against application retry budgets > > > >>>>>>> Wasted computation - Worker crashes shouldn't restart healthy > > > 3-hour > > > >>>>>> Databricks jobs from zero > > > >>>>>>> How > > > >>>>>>> > > > >>>>>>> Execution Context: Distinguish infrastructure vs application > > > failures > > > >>>>>> for smarter retry handling > > > >>>>>>> Resumable Operators: Checkpoint and reconnect to external jobs > > > after > > > >>>>>> disruptions (follows deferral pattern) > > > >>>>>>> These approaches have significantly improved reliability and > user > > > >>>>>> experience, and reduced wasted costs in our production > > environment. > > > >>>>>>> Looking forward to your feedback on both the problems we're > > > addressing > > > >>>>>> and the proposed solutions. Both proposals are fully backward > > > compatible > > > >>>>>> and follow existing Airflow patterns. > > > >>>>>>> Happy to answer any questions or dive deeper into > implementation > > > >>>>> details. > > > >>>>>>> Best, > > > >>>>>>> > > > >>>>>>> Stefan Wang > > > >>>>>>> > > > >>>>>>> > > > >>>>>>> > > > >>>>>> > > > --------------------------------------------------------------------- > > > >>>>>> To unsubscribe, e-mail: [email protected] > > > >>>>>> For additional commands, e-mail: [email protected] > > > >>>>>> > > > >>>>>> > > > >>> > > > >> > > > >> > --------------------------------------------------------------------- > > > >> To unsubscribe, e-mail: [email protected] > > > >> For additional commands, e-mail: [email protected] > > > >> > > > > > > > > > > > > >
