Dear Airflowers, Over the last several months, there have been a lot of discussions in the devlist around improvements needed for long running jobs outside of Airflow (raised by XD and others), and about improved event triggering (raised by Jake and others). XD, Jake, and I have gotten together and collaborated on a unified approach for Task State Management within Airflow which we would like to propose.
Apache Airflow has been built around stateless, idempotent tasks, and this has served the community incredibly well. But as production AI and data workloads have grown more sophisticated, a clear gap has emerged that the community has been working around for a while. Three patterns keep coming up. An incremental operator needs to know where it left off last time, so it does not reprocess data it has already handled. An operator running a Databricks or EMR job needs to survive a worker disruption without cancelling a job that was 90% complete and starting over from scratch. A long-running async task processing thousands of files needs to checkpoint its progress so a retry picks up where it left off, not from the beginning. All three patterns are forcing users into the same workarounds today generally bending XCom beyond its intended purpose, or building their own state persistence outside of Airflow entirely. We think we can do better. AIP-XX: Task State Management is a new foundation AIP that addresses all three patterns through a single, minimal, pluggable framework. Built on top of the Execution API from AIP-72, with full async support consistent with AIP-98, Task State is deliberately and cleanly separate from XCom, with different scoping, different lifecycle semantics, and different garbage collection mechanics. It also provides the foundation for a simplified AIP-93 (Asset Watermarking) <https://cwiki.apache.org/confluence/display/AIRFLOW/%5BWIP%5D+AIP-93+Asset+Watermarks+and+State+Variables> and for long running remote operations using either the AIP-tbd Persistent Parameter for Airflow Operators <https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=399278333> or AIP-96 (Resumable Operators) <https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-96+Resumable+Operators> . Full draft is on Confluence as Draft AIP-xx: Task State Management <https://cwiki.apache.org/confluence/display/AIRFLOW/Draft%3A+AIP-xx%3A+Task+State+Management> We would love to hear your thoughts. Please comment on the AIP doc. Best regards, Vikram, XD, and Jake -- Vikram Koka Chief Strategy Officer Email: [email protected] <https://www.astronomer.io/>
