Re: AIP - Add "persist_xcom_through_retry" Parameter to Airflow Operators

Jens Scheffler Tue, 18 Nov 2025 22:29:00 -0800

Hi,

would add (6) as use case as I made it in the Confluence as comment andTP highlighted: Add try number and keep history for seeing differencesbetween runs (as admin for sanity check/history after dag code was e.g.patched - might be a dowstream task was not re-run and was depending onan older XCom ... so that would help in case of troubleshooting.

But in (6) NOT as to have logic based on try_number as this would beanother purpose in my view.

So in this case I think the discussion is valuable and some extension inall the listed use cases makes sense to me!


Jens

On 11/19/25 07:07, Tzu-ping Chung via dev wrote:

What I feel is, while it is fine to have more than one way to do a thing, some 
of the examples do not sufficiently discuss why existing features are not 
suitable for the use case. This context is important since it would affect how 
we implement the new feature to sufficiently distinguish it from existing ones, 
so it is easier to make the correct decision when you are choosing between 
features to achieve a goal. It is also a good chance for us to take a look at 
enhancing other existing features so they cover more use cases and work better 
together.

I’ll try to break down each use case in the appendix. To be clear, I can think 
of some possibilities for each case why a new feature is preferred, but the 
problem is the document should sufficiently explore and discuss existing 
solutions.

1: Large Dataset Processing with Checkpoints

It is unclear from the example how the use case cannot be satisfied by dynamic 
task mapping:

     @task
     def process_record(record): ...

     @task(trigger_rule="always")
     def summary(results): ...

     results = process_record.expand(record=get_records_to_process())
     summary(results)

2: External Job Tracking and Polling

This looks like a use case for sensors to me.

3: More Efficient API Integration

Why does make_api_calls need to be in the same task? All existing patterns in 
Airflow  point to making it a separate task.

4: Resource Management and Cleanup

Isn’t this what teardown tasks are for?

5: Adaptive Processing with Learning

This is the use case that I feel the proposal is most useful for. However, it 
can also be satisfied by Variable, or the state persistence mechanism mentioned 
by Ash.

In some ways, the three are really the same thing—a way to keep context—except 
they have different scopes. Variable has the global (to the Airflow instance) 
scope, XCom the task runner process scope (almost task instance scope but not 
quite since it’s cleared for retry). StateVariable is also global as currently 
proposed, but from the listed use cases, it is arguably more suitable to be 
task- or dag-scoped (not to be confused to being scoped to a task instance or 
dag run).

Back to the proposal at hand, the way I understand persist_xcom_through_retry 
is it essentially switches all XComs pushed in the task from being scoped by 
the task instance *try* to the task instance (across all tries). I think the 
idea itself is worth having, and having a task-level flag may be a good way to 
expose it to users. However, I feel there are some choices we can still discuss 
on what the feature actually means beyond having a flag that does one specific 
thing internally.

For example, perhaps we should remodel XComModel to include a try_number, and 
allow it to be scoped both against a ti or a ti try? Potentially even more 
choices such as task-scoped across runs, or globally by unifying Variable? 
There are many open questions from my point of view, and again, I feel the 
proposal document should discuss the use cases in more detail to pin down the 
specifics, instead of leaving things out for interpretation.

TP

On 19 Nov 2025, at 06:20, Xiaodong Deng <[email protected]> wrote:

Thanks for your valuable feedback, folks.

Hi @TP,

There are cases where breaking down to multiple tasks is not feasible or not 
the best option. For example, the use case 1 I have shared in the Confluence 
doc appendix.

There are also examples where splitting into multiple tasks may seem make sense 
but may cause down-side effect. In use case 2 and 4 in the Confluence doc 
appendix, I shared why we do it in a single task instead of splitting them into 
two tasks.

Some tasks are simply atomic.


Hi @Jarek,

I'm glad we are talking about idempotency. That's exactly why sometimes we cannot break 
down some tasks. In the "Problem Examples" section in the Confluence doc, I 
covered that at some extent.

Would love to discuss more on this, or learn from you for any alternative 
solutions which can become available to Airflow users in a timely manner.

Many thanks!


Regards,
XD

On 2025/11/16 09:48:10 Jarek Potiuk wrote:

I agree with TP wholeheartedly. The basic reason why XCom is deleted when
restarting is to maintain idempotency principles. And if we allow XCom to
be used to break idempotency (that's basically what state per task is
about) - then XCom will stop serving its purpose.

And of course - we are in the new "world" where we are not only supporting
idempotent tasks, Various optimisations and different kinds of workloads
require breaking the "old" idempotency rules we used to have when Airflow
was used mainly for ETL. And deletion of XCom state was also questioned
back then because people **wanted** to use Xcom in other ways. But we held
strongly and I think that was a good choice.

And while repurposing XCom to do "something" else might seem like a good
idea - even for Apple, because they could internally agree to some
convention and use it as "solution". But when you look at Airflow as a
product, repurposing XCome to also do something else (i.e. storing state)
seems a bit "lazy" and "short-cut-y".

What does it save if you do it this way? Few things:

* not having to do database migration to implement new feature
* avoiding having a clearly defined API where state can be stored for
various purposes on different levels (Task Instance, Task, Task Group
maybe, Dag, Team eventually)
* avoiding to think and prepare for all the various use cases that people
really would like to use it
* avoiding to write the use-case documentation explaining how you can use
state
* avoiding to write all the test cases making sure that all those use cases
are served way
* not thinking too much about performance and security implications of
those ("Xcom has it already sorted out, I am sure it's going to be fine")

Yes, it can be done way faster this way. and I understand some commercial
users could have chosen this way as a shortcut to handle a specific use
case they had in mind. This is absolutely understandable, and this is what
I would even expect a for-profit company to do to increase so-called
"time-to-market" and start reaping the benefits of it faster.

But should we do it in Airflow the same way ? We are not a for-profit
company, time-to-market of such a feature is secondary, compared to the
stability, maintainability and having a "product" vision.
I consider all the above points as absolutely crucial properties of a
"product" - which Airflow is. They might not be needed in a "solution", but
having a good "product" - absolutely requires all those things,

When we switched to Airflow 3, one of the ideas was to remove all the bad
"solution-y" decisions we made in the past that slowed us down in general
and - more importantly - turned us into (as Daniel used to say) into
"back-compatibility engineers"

Does it mean it will take longer and require more dedication and effort
and discussions to agree on the scope ? Absolutely. Is this a bad thing? I
don't think so.

J.


On Sun, Nov 16, 2025 at 9:43 AM Tzu-ping Chung via dev <
[email protected]> wrote:

What is the motivation behind storing internal state in a task, instead of
splitting the logic on state boundaries into multiple tasks? That’s what
the task abstraction is supposed for, and you wouldn’t need to a separate
mechanism for that—regular XCom would just work.

While storing state is a legitimate use case, I feel this particular idea
would have a more negative impact on encouraging people to do too many
things in one task. I’d even argue the examples given in the Confluence
document are already so.

TP

On 14 Nov 2025, at 08:32, Xiaodong Deng <[email protected]> wrote:

Hi folks!

We would like to propose a new feature in Airflow, a boolean
parameter  "persist_xcom_through_retry" Parameter in all Airflow

Operators.

Our team added this feature in our internal fork a few years back, and it
has been benefiting our users extensively.

*I have created an AIP
at

https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=399278333

https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=399278333

*.
Below is a summary (in the complete AIP, we have a more detailed problem
statement and quite a few interesting use-case examples):




*Traditionally, XCom is defined as “a mechanism that lets Tasks talk to
each other”. However, XCom also has the capacity and potential to help
persist and manage task state within a task itself.Currently, Apache
Airflow automatically clears a task instance’s XCom data when it is
retried. This behavior, while ensuring clean state for retry attempts,
creates limitations:*

  - *Loss of Internal Progress: Tasks that have internal checkpointing or
  progress tracking lose all intermediate state on retry, forcing restart
  from the beginning.*
  - *Resource State Loss: Tasks cannot maintain state about allocated
  resources (compute instances, downstream job IDs, etc.) across retry
  attempts, leading to redundant expensive setup operations.*
  - *No Recovery/Resume Capability: There's no way for tasks to resume
  from internal checkpoints when transient failures occur during
  long-running atomicoperations.*
  - *Poor User Experience: users must implement external state management
  systems to work around this limitation, adding complexity to DAG

authoring.*


*This proposal aims at extending the capacity of XCom by allowing
persisting a Task Instance’s XCom through its retries, enabling users to
build more resilient and efficient pipelines. This is particularly useful
for the type of tasks which are atomic (so one such task cannot be split
into multiple tasks) and need to manage internal state or checkpoints. *


We look forward to your feedback and thoughts. Thanks!


Regards,

XD


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: AIP - Add "persist_xcom_through_retry" Parameter to Airflow Operators

Reply via email to