Re: AIP - Add "persist_xcom_through_retry" Parameter to Airflow Operators

Xiaodong Deng Tue, 18 Nov 2025 14:21:46 -0800

Thanks for your valuable feedback, folks.

Hi @TP,


There are cases where breaking down to multiple tasks is not feasible or not 
the best option. For example, the use case 1 I have shared in the Confluence 
doc appendix.

There are also examples where splitting into multiple tasks may seem make sense 
but may cause down-side effect. In use case 2 and 4 in the Confluence doc 
appendix, I shared why we do it in a single task instead of splitting them into 
two tasks.

Some tasks are simply atomic.


Hi @Jarek,

I'm glad we are talking about idempotency. That's exactly why sometimes we 
cannot break down some tasks. In the "Problem Examples" section in the 
Confluence doc, I covered that at some extent.

Would love to discuss more on this, or learn from you for any alternative 
solutions which can become available to Airflow users in a timely manner.

Many thanks!


Regards,
XD

On 2025/11/16 09:48:10 Jarek Potiuk wrote:
> I agree with TP wholeheartedly. The basic reason why XCom is deleted when
> restarting is to maintain idempotency principles. And if we allow XCom to
> be used to break idempotency (that's basically what state per task is
> about) - then XCom will stop serving its purpose.
> 
> And of course - we are in the new "world" where we are not only supporting
> idempotent tasks, Various optimisations and different kinds of workloads
> require breaking the "old" idempotency rules we used to have when Airflow
> was used mainly for ETL. And deletion of XCom state was also questioned
> back then because people **wanted** to use Xcom in other ways. But we held
> strongly and I think that was a good choice.
> 
> And while repurposing XCom to do "something" else might seem like a good
> idea - even for Apple, because they could internally agree to some
> convention and use it as "solution". But when you look at Airflow as a
> product, repurposing XCome to also do something else (i.e. storing state)
> seems a bit "lazy" and "short-cut-y".
> 
> What does it save if you do it this way? Few things:
> 
> * not having to do database migration to implement new feature
> * avoiding having a clearly defined API where state can be stored for
> various purposes on different levels (Task Instance, Task, Task Group
> maybe, Dag, Team eventually)
> * avoiding to think and prepare for all the various use cases that people
> really would like to use it
> * avoiding to write the use-case documentation explaining how you can use
> state
> * avoiding to write all the test cases making sure that all those use cases
> are served way
> * not thinking too much about performance and security implications of
> those ("Xcom has it already sorted out, I am sure it's going to be fine")
> 
> Yes, it can be done way faster this way. and I understand some commercial
> users could have chosen this way as a shortcut to handle a specific use
> case they had in mind. This is absolutely understandable, and this is what
> I would even expect a for-profit company to do to increase so-called
> "time-to-market" and start reaping the benefits of it faster.
> 
> But should we do it in Airflow the same way ? We are not a for-profit
> company, time-to-market of such a feature is secondary, compared to the
> stability, maintainability and having a "product" vision.
> I consider all the above points as absolutely crucial properties of a
> "product" - which Airflow is. They might not be needed in a "solution", but
> having a good "product" - absolutely requires all those things,
> 
> When we switched to Airflow 3, one of the ideas was to remove all the bad
> "solution-y" decisions we made in the past that slowed us down in general
> and - more importantly - turned us into (as Daniel used to say) into
> "back-compatibility engineers"
> 
> Does it mean it will take longer and require more dedication and effort
> and discussions to agree on the scope ? Absolutely. Is this a bad thing? I
> don't think so.
> 
> J.
> 
> 
> On Sun, Nov 16, 2025 at 9:43 AM Tzu-ping Chung via dev <
> [email protected]> wrote:
> 
> > What is the motivation behind storing internal state in a task, instead of
> > splitting the logic on state boundaries into multiple tasks? That’s what
> > the task abstraction is supposed for, and you wouldn’t need to a separate
> > mechanism for that—regular XCom would just work.
> >
> > While storing state is a legitimate use case, I feel this particular idea
> > would have a more negative impact on encouraging people to do too many
> > things in one task. I’d even argue the examples given in the Confluence
> > document are already so.
> >
> > TP
> >
> >
> > > On 14 Nov 2025, at 08:32, Xiaodong Deng <[email protected]> wrote:
> > >
> > > Hi folks!
> > >
> > > We would like to propose a new feature in Airflow, a boolean
> > > parameter  "persist_xcom_through_retry" Parameter in all Airflow
> > Operators.
> > > Our team added this feature in our internal fork a few years back, and it
> > > has been benefiting our users extensively.
> > >
> > > *I have created an AIP
> > > at
> > https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=399278333
> > > <
> > https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=399278333
> > >*.
> > > Below is a summary (in the complete AIP, we have a more detailed problem
> > > statement and quite a few interesting use-case examples):
> > >
> > >
> > >
> > >
> > > *Traditionally, XCom is defined as “a mechanism that lets Tasks talk to
> > > each other”. However, XCom also has the capacity and potential to help
> > > persist and manage task state within a task itself.Currently, Apache
> > > Airflow automatically clears a task instance’s XCom data when it is
> > > retried. This behavior, while ensuring clean state for retry attempts,
> > > creates limitations:*
> > >
> > >   - *Loss of Internal Progress: Tasks that have internal checkpointing or
> > >   progress tracking lose all intermediate state on retry, forcing restart
> > >   from the beginning.*
> > >   - *Resource State Loss: Tasks cannot maintain state about allocated
> > >   resources (compute instances, downstream job IDs, etc.) across retry
> > >   attempts, leading to redundant expensive setup operations.*
> > >   - *No Recovery/Resume Capability: There's no way for tasks to resume
> > >   from internal checkpoints when transient failures occur during
> > >   long-running atomicoperations.*
> > >   - *Poor User Experience: users must implement external state management
> > >   systems to work around this limitation, adding complexity to DAG
> > authoring.*
> > >
> > >
> > > *This proposal aims at extending the capacity of XCom by allowing
> > > persisting a Task Instance’s XCom through its retries, enabling users to
> > > build more resilient and efficient pipelines. This is particularly useful
> > > for the type of tasks which are atomic (so one such task cannot be split
> > > into multiple tasks) and need to manage internal state or checkpoints. *
> > >
> > >
> > > We look forward to your feedback and thoughts. Thanks!
> > >
> > >
> > > Regards,
> > >
> > > XD
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: [email protected]
> > For additional commands, e-mail: [email protected]
> >
> >
> 

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: AIP - Add "persist_xcom_through_retry" Parameter to Airflow Operators

Reply via email to