Dear Airflowers,

Over the last several months, there have been a lot of discussions in the
devlist around improvements needed for long running jobs outside of Airflow
(raised by XD and others), and about improved event triggering (raised by
Jake and others). XD, Jake, and I have gotten together and collaborated on
a unified approach for Task State Management within Airflow which we would
like to propose.

Apache Airflow has been built around stateless, idempotent tasks, and this
has served the community incredibly well. But as production AI and data
workloads have grown more sophisticated, a clear gap has emerged that the
community has been working around for a while.

Three patterns keep coming up. An incremental operator needs to know where
it left off last time, so it does not reprocess data it has already
handled. An operator running a Databricks or EMR job needs to survive a
worker disruption without cancelling a job that was 90% complete and
starting over from scratch. A long-running async task processing thousands
of files needs to checkpoint its progress so a retry picks up where it left
off, not from the beginning.

All three patterns are forcing users into the same workarounds today
generally bending XCom beyond its intended purpose, or building their own
state persistence outside of Airflow entirely.

We think we can do better. AIP-XX: Task State Management is a new
foundation AIP that addresses all three patterns through a single, minimal,
pluggable framework. Built on top of the Execution API from AIP-72, with
full async support consistent with AIP-98, Task State is deliberately and
cleanly separate from XCom, with different scoping, different lifecycle
semantics, and different garbage collection mechanics. It also provides the
foundation for a simplified AIP-93 (Asset Watermarking)
<https://cwiki.apache.org/confluence/display/AIRFLOW/%5BWIP%5D+AIP-93+Asset+Watermarks+and+State+Variables>
and for long running remote operations using either the AIP-tbd Persistent
Parameter for Airflow Operators
<https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=399278333>
or AIP-96 (Resumable Operators)
<https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-96+Resumable+Operators>
.

Full draft is on Confluence as Draft AIP-xx: Task State Management
<https://cwiki.apache.org/confluence/display/AIRFLOW/Draft%3A+AIP-xx%3A+Task+State+Management>

We would love to hear your thoughts. Please comment on the AIP doc.

Best regards,
Vikram, XD, and Jake
-- 

Vikram Koka
Chief Strategy Officer
Email: [email protected]


<https://www.astronomer.io/>

Reply via email to