Thanks Vikram, This is a crucial AIP for Airflow 3.3+. I skimmed through it and will provide more comments over the coming days, but it very much looks like what I imagined for state management in Airflow. It has about the right abstraction layer, focusing on building infrastructure that serves the previously articulated - use cases and likely supports other use cases we are not yet aware of. I really like how it maps the "generic" interface into those cases.
I have this old "rule of thumb": you need at least three use cases to be able to design a truly reusable infrastructure API/component. .. Here we have 3 use cases it will serve :) Jl On Sat, Mar 21, 2026 at 8:44 PM Vikram Koka via dev <[email protected]> wrote: > Dear Airflowers, > > Over the last several months, there have been a lot of discussions in the > devlist around improvements needed for long running jobs outside of Airflow > (raised by XD and others), and about improved event triggering (raised by > Jake and others). XD, Jake, and I have gotten together and collaborated on > a unified approach for Task State Management within Airflow which we would > like to propose. > > Apache Airflow has been built around stateless, idempotent tasks, and this > has served the community incredibly well. But as production AI and data > workloads have grown more sophisticated, a clear gap has emerged that the > community has been working around for a while. > > Three patterns keep coming up. An incremental operator needs to know where > it left off last time, so it does not reprocess data it has already > handled. An operator running a Databricks or EMR job needs to survive a > worker disruption without cancelling a job that was 90% complete and > starting over from scratch. A long-running async task processing thousands > of files needs to checkpoint its progress so a retry picks up where it left > off, not from the beginning. > > All three patterns are forcing users into the same workarounds today > generally bending XCom beyond its intended purpose, or building their own > state persistence outside of Airflow entirely. > > We think we can do better. AIP-XX: Task State Management is a new > foundation AIP that addresses all three patterns through a single, minimal, > pluggable framework. Built on top of the Execution API from AIP-72, with > full async support consistent with AIP-98, Task State is deliberately and > cleanly separate from XCom, with different scoping, different lifecycle > semantics, and different garbage collection mechanics. It also provides the > foundation for a simplified AIP-93 (Asset Watermarking) > < > https://cwiki.apache.org/confluence/display/AIRFLOW/%5BWIP%5D+AIP-93+Asset+Watermarks+and+State+Variables > > > and for long running remote operations using either the AIP-tbd Persistent > Parameter for Airflow Operators > < > https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=399278333 > > > or AIP-96 (Resumable Operators) > < > https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-96+Resumable+Operators > > > . > > Full draft is on Confluence as Draft AIP-xx: Task State Management > < > https://cwiki.apache.org/confluence/display/AIRFLOW/Draft%3A+AIP-xx%3A+Task+State+Management > > > > We would love to hear your thoughts. Please comment on the AIP doc. > > Best regards, > Vikram, XD, and Jake > -- > > Vikram Koka > Chief Strategy Officer > Email: [email protected] > > > <https://www.astronomer.io/> >
