Thanks Vikram, Jake, XD also from my side!

A big +1 for moving this forward and I think this is really important. Though from reading over it I do not see why it is marked as DRAFT, because besides nt I think it is already very mature. All what I saw is in general "right". So I hope this is a not really controversional discuss and then we can get this in 3.3!

(Some could say this concept is overdue... but is important to have!)

Jens

On 21.03.26 20:58, Jarek Potiuk wrote:
Thanks Vikram,

This is a crucial AIP for Airflow 3.3+. I skimmed through it and will
provide more comments over the coming days, but it very much looks like
what I imagined for state management in Airflow.
It has about the right abstraction layer, focusing on building
infrastructure that serves the previously articulated - use cases and
likely supports other use cases we are not yet aware of. I really like how
it maps the "generic" interface into those cases.

I have this old "rule of thumb": you need at least three use cases to be
able to design a truly reusable infrastructure API/component. .. Here we
have 3 use cases it will serve :)

Jl


On Sat, Mar 21, 2026 at 8:44 PM Vikram Koka via dev <[email protected]>
wrote:

Dear Airflowers,

Over the last several months, there have been a lot of discussions in the
devlist around improvements needed for long running jobs outside of Airflow
(raised by XD and others), and about improved event triggering (raised by
Jake and others). XD, Jake, and I have gotten together and collaborated on
a unified approach for Task State Management within Airflow which we would
like to propose.

Apache Airflow has been built around stateless, idempotent tasks, and this
has served the community incredibly well. But as production AI and data
workloads have grown more sophisticated, a clear gap has emerged that the
community has been working around for a while.

Three patterns keep coming up. An incremental operator needs to know where
it left off last time, so it does not reprocess data it has already
handled. An operator running a Databricks or EMR job needs to survive a
worker disruption without cancelling a job that was 90% complete and
starting over from scratch. A long-running async task processing thousands
of files needs to checkpoint its progress so a retry picks up where it left
off, not from the beginning.

All three patterns are forcing users into the same workarounds today
generally bending XCom beyond its intended purpose, or building their own
state persistence outside of Airflow entirely.

We think we can do better. AIP-XX: Task State Management is a new
foundation AIP that addresses all three patterns through a single, minimal,
pluggable framework. Built on top of the Execution API from AIP-72, with
full async support consistent with AIP-98, Task State is deliberately and
cleanly separate from XCom, with different scoping, different lifecycle
semantics, and different garbage collection mechanics. It also provides the
foundation for a simplified AIP-93 (Asset Watermarking)
<
https://cwiki.apache.org/confluence/display/AIRFLOW/%5BWIP%5D+AIP-93+Asset+Watermarks+and+State+Variables
and for long running remote operations using either the AIP-tbd Persistent
Parameter for Airflow Operators
<
https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=399278333
or AIP-96 (Resumable Operators)
<
https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-96+Resumable+Operators
.

Full draft is on Confluence as Draft AIP-xx: Task State Management
<
https://cwiki.apache.org/confluence/display/AIRFLOW/Draft%3A+AIP-xx%3A+Task+State+Management
We would love to hear your thoughts. Please comment on the AIP doc.

Best regards,
Vikram, XD, and Jake
--

Vikram Koka
Chief Strategy Officer
Email: [email protected]


<https://www.astronomer.io/>


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to