I definitely understand the use case - resumable checkpoints would be an excellent improvement for long-running jobs.
I also agree with Jarek that there should be some kind of state management in Airflow and acknowledge the various proposals of state management. Watermarking/CDC would potentially be easier with something like that. I am not sure I fully understand why the checkpointing proposal couldn't be implemented with Deferrable operators, and no other change? I might be missing something obvious. Couldn't you raise a TaskDeferred multiple times, at each checkpoint - with different kwargs and/or a different "next method"? It doesn't *have* to be "execute_complete", right? A second thing that I don't think I grok - wouldn't the checkpoints be more useful to be set by the DAG Author, rather than the Provider Author? I suppose there's some checkpoints like "provision a cluster", "run a job", "cleanup" that'd make sense to the Operator/Provider Author, but mid-run there are likely checkpoints the DAG Author could identify and be able to resume from. Airflow currently encourages you to use Tasks as the checkpoint and break work apart on *that* boundary, in my opinion, which likely causes unnecessary network or disk IO if you need to collect results and save them somewhere to be able to resume/continue the next chunk of work. On Fri, Nov 14, 2025 at 6:36 AM Stefan Wang <[email protected]> wrote: > Hi Airflow Community, > > I'm excited to share two complementary proposals that address critical > reliability challenges in Airflow, particularly around infrastructure > disruptions and task resilience. These proposals build on insights from > managing one of the larger Airflow deployments (20k+ DAGs, 100k+ daily task > executions per cluster). > > Proposals > > 1. Infrastructure-Aware Task Execution and Context Propagation > > > https://docs.google.com/document/d/1BAOJTAPfWK93JnN6LQrISo8IqDiE7LpnfG2Q42fnn7M > > 2. Resumable Operators for Disruption Readiness > > > https://docs.google.com/document/d/1XPbCfuTVhyiq12tFxbyQrX_kQqrDqMo5t7M789MG4GI > > What We're Solving > > Infrastructure failures consume user retries - Pod evictions shouldn't > count against application retry budgets > Wasted computation - Worker crashes shouldn't restart healthy 3-hour > Databricks jobs from zero > How > > Execution Context: Distinguish infrastructure vs application failures for > smarter retry handling > Resumable Operators: Checkpoint and reconnect to external jobs after > disruptions (follows deferral pattern) > These approaches have significantly improved reliability and user > experience, and reduced wasted costs in our production environment. > > Looking forward to your feedback on both the problems we're addressing and > the proposed solutions. Both proposals are fully backward compatible and > follow existing Airflow patterns. > > Happy to answer any questions or dive deeper into implementation details. > > Best, > > Stefan Wang > > > -- -Fritz Davenport Principle Data Engineer & Staff Architect, Customer Dept @ Astronomer
