Re: AIP - Add "persist_xcom_through_retry" Parameter to Airflow Operators

Daniel Standish via dev Tue, 18 Nov 2025 15:03:15 -0800

I think that things like idempotency and "too big tasks" are wielded too
often as arguments against features.


For me the questions I think about are the following:
* is this a good interface?
* is this too complicated / confusing / hard to understand?
* is this complexity that isn't merited by the value of the feature?

There are a lot of features in Airflow that bear some similarities to this
one, which, based on these considerations, I *do not* really like.

One example is *wait_for_past_depends_before_skipping*.  When I look at
this, I say, what the hell does this mean?

Another is Pool.include_deferred.

Everyone may have different ones, but anyway you probably get the point.
I'm not sure how I feel about this one exactly.

*BUT*

I want to suggest that "idempotency" and "task bigness" are not really what
we should be thinking about here.

Re

The basic reason why XCom is deleted when restarting is to maintain
> idempotency principles


*It doesn't break it if you don't use it.*

And there is absolutely no guarantee that tasks in airflow are idempotent
--- it is 100% up to the user, who provides the operator params, whether
the task will be idempotent.  We give you the tools to *solve your real
world problems*.  And it's up to *you *to do it in a way that works for
*you*.

There is a vast world of tasks that Airflow does and should support that
either are not idempotent or, perhaps more precisely, for which idempotence
isn't really well-defined.  E.g. is an incremental insert-update process
idempotent?  It doesn't do quite the same thing every time.  Yet, there are
no negative consequences of running it over and over.

So to me, we add the feature based on its usefulness and the adequacy of
its design only.

Sure, if it somehow *interfered* with users ability to write "idempotent"
pipelines, then that's a valid consideration.  But it doesn't because it's
an opt-in feature.




On Tue, Nov 18, 2025 at 2:46 PM Jarek Potiuk <[email protected]> wrote:

> Proposed Alternative:
>
> Complete and propose a regular "state" storage proposal - there were plenty
> of discussions about that - including Asset Watermarks that Ash mentioned.
> I think the best way is to lead that discussion to completion, and as
> result come up with a state management that can be used in this case as
> well.
>
> As mentioned in my previous - mail - my thinking we are not in
> "time-to-market" game. We are more in "delliver good product".  If it will
> take more time, so be it, but let's do it properly. There is not much to
> loose by having it later, but there is a lot to loose collectively if our
> users will start misusing half-backed feature that will mislead them to do
> something we do not want them to do.
>
> J.
>
>
> On Tue, Nov 18, 2025 at 11:25 PM Xiaodong Deng <[email protected]> wrote:
>
> > In addition, I understand we would like to stick to certain
> > design/principles. However, if that is blocking certain reasonable use
> > cases, either alternative solutions need to be provided or "principles"
> > need to be adjusted.
> >
> > That's what I'm hoping for here.
> >
> > Thanks again!
> >
> >
> > Regards,
> > XD
> >
> > On 2025/11/18 22:20:36 Xiaodong Deng wrote:
> > > Thanks for your valuable feedback, folks.
> > >
> > > Hi @TP,
> > >
> > > There are cases where breaking down to multiple tasks is not feasible
> or
> > not the best option. For example, the use case 1 I have shared in the
> > Confluence doc appendix.
> > >
> > > There are also examples where splitting into multiple tasks may seem
> > make sense but may cause down-side effect. In use case 2 and 4 in the
> > Confluence doc appendix, I shared why we do it in a single task instead
> of
> > splitting them into two tasks.
> > >
> > > Some tasks are simply atomic.
> > >
> > >
> > > Hi @Jarek,
> > >
> > > I'm glad we are talking about idempotency. That's exactly why sometimes
> > we cannot break down some tasks. In the "Problem Examples" section in the
> > Confluence doc, I covered that at some extent.
> > >
> > > Would love to discuss more on this, or learn from you for any
> > alternative solutions which can become available to Airflow users in a
> > timely manner.
> > >
> > > Many thanks!
> > >
> > >
> > > Regards,
> > > XD
> > >
> > > On 2025/11/16 09:48:10 Jarek Potiuk wrote:
> > > > I agree with TP wholeheartedly. The basic reason why XCom is deleted
> > when
> > > > restarting is to maintain idempotency principles. And if we allow
> XCom
> > to
> > > > be used to break idempotency (that's basically what state per task is
> > > > about) - then XCom will stop serving its purpose.
> > > >
> > > > And of course - we are in the new "world" where we are not only
> > supporting
> > > > idempotent tasks, Various optimisations and different kinds of
> > workloads
> > > > require breaking the "old" idempotency rules we used to have when
> > Airflow
> > > > was used mainly for ETL. And deletion of XCom state was also
> questioned
> > > > back then because people **wanted** to use Xcom in other ways. But we
> > held
> > > > strongly and I think that was a good choice.
> > > >
> > > > And while repurposing XCom to do "something" else might seem like a
> > good
> > > > idea - even for Apple, because they could internally agree to some
> > > > convention and use it as "solution". But when you look at Airflow as
> a
> > > > product, repurposing XCome to also do something else (i.e. storing
> > state)
> > > > seems a bit "lazy" and "short-cut-y".
> > > >
> > > > What does it save if you do it this way? Few things:
> > > >
> > > > * not having to do database migration to implement new feature
> > > > * avoiding having a clearly defined API where state can be stored for
> > > > various purposes on different levels (Task Instance, Task, Task Group
> > > > maybe, Dag, Team eventually)
> > > > * avoiding to think and prepare for all the various use cases that
> > people
> > > > really would like to use it
> > > > * avoiding to write the use-case documentation explaining how you can
> > use
> > > > state
> > > > * avoiding to write all the test cases making sure that all those use
> > cases
> > > > are served way
> > > > * not thinking too much about performance and security implications
> of
> > > > those ("Xcom has it already sorted out, I am sure it's going to be
> > fine")
> > > >
> > > > Yes, it can be done way faster this way. and I understand some
> > commercial
> > > > users could have chosen this way as a shortcut to handle a specific
> use
> > > > case they had in mind. This is absolutely understandable, and this is
> > what
> > > > I would even expect a for-profit company to do to increase so-called
> > > > "time-to-market" and start reaping the benefits of it faster.
> > > >
> > > > But should we do it in Airflow the same way ? We are not a for-profit
> > > > company, time-to-market of such a feature is secondary, compared to
> the
> > > > stability, maintainability and having a "product" vision.
> > > > I consider all the above points as absolutely crucial properties of a
> > > > "product" - which Airflow is. They might not be needed in a
> > "solution", but
> > > > having a good "product" - absolutely requires all those things,
> > > >
> > > > When we switched to Airflow 3, one of the ideas was to remove all the
> > bad
> > > > "solution-y" decisions we made in the past that slowed us down in
> > general
> > > > and - more importantly - turned us into (as Daniel used to say) into
> > > > "back-compatibility engineers"
> > > >
> > > > Does it mean it will take longer and require more dedication and
> effort
> > > > and discussions to agree on the scope ? Absolutely. Is this a bad
> > thing? I
> > > > don't think so.
> > > >
> > > > J.
> > > >
> > > >
> > > > On Sun, Nov 16, 2025 at 9:43 AM Tzu-ping Chung via dev <
> > > > [email protected]> wrote:
> > > >
> > > > > What is the motivation behind storing internal state in a task,
> > instead of
> > > > > splitting the logic on state boundaries into multiple tasks? That’s
> > what
> > > > > the task abstraction is supposed for, and you wouldn’t need to a
> > separate
> > > > > mechanism for that—regular XCom would just work.
> > > > >
> > > > > While storing state is a legitimate use case, I feel this
> particular
> > idea
> > > > > would have a more negative impact on encouraging people to do too
> > many
> > > > > things in one task. I’d even argue the examples given in the
> > Confluence
> > > > > document are already so.
> > > > >
> > > > > TP
> > > > >
> > > > >
> > > > > > On 14 Nov 2025, at 08:32, Xiaodong Deng <[email protected]>
> wrote:
> > > > > >
> > > > > > Hi folks!
> > > > > >
> > > > > > We would like to propose a new feature in Airflow, a boolean
> > > > > > parameter  "persist_xcom_through_retry" Parameter in all Airflow
> > > > > Operators.
> > > > > > Our team added this feature in our internal fork a few years
> back,
> > and it
> > > > > > has been benefiting our users extensively.
> > > > > >
> > > > > > *I have created an AIP
> > > > > > at
> > > > >
> >
> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=399278333
> > > > > > <
> > > > >
> >
> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=399278333
> > > > > >*.
> > > > > > Below is a summary (in the complete AIP, we have a more detailed
> > problem
> > > > > > statement and quite a few interesting use-case examples):
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > > *Traditionally, XCom is defined as “a mechanism that lets Tasks
> > talk to
> > > > > > each other”. However, XCom also has the capacity and potential to
> > help
> > > > > > persist and manage task state within a task itself.Currently,
> > Apache
> > > > > > Airflow automatically clears a task instance’s XCom data when it
> is
> > > > > > retried. This behavior, while ensuring clean state for retry
> > attempts,
> > > > > > creates limitations:*
> > > > > >
> > > > > >   - *Loss of Internal Progress: Tasks that have internal
> > checkpointing or
> > > > > >   progress tracking lose all intermediate state on retry, forcing
> > restart
> > > > > >   from the beginning.*
> > > > > >   - *Resource State Loss: Tasks cannot maintain state about
> > allocated
> > > > > >   resources (compute instances, downstream job IDs, etc.) across
> > retry
> > > > > >   attempts, leading to redundant expensive setup operations.*
> > > > > >   - *No Recovery/Resume Capability: There's no way for tasks to
> > resume
> > > > > >   from internal checkpoints when transient failures occur during
> > > > > >   long-running atomicoperations.*
> > > > > >   - *Poor User Experience: users must implement external state
> > management
> > > > > >   systems to work around this limitation, adding complexity to
> DAG
> > > > > authoring.*
> > > > > >
> > > > > >
> > > > > > *This proposal aims at extending the capacity of XCom by allowing
> > > > > > persisting a Task Instance’s XCom through its retries, enabling
> > users to
> > > > > > build more resilient and efficient pipelines. This is
> particularly
> > useful
> > > > > > for the type of tasks which are atomic (so one such task cannot
> be
> > split
> > > > > > into multiple tasks) and need to manage internal state or
> > checkpoints. *
> > > > > >
> > > > > >
> > > > > > We look forward to your feedback and thoughts. Thanks!
> > > > > >
> > > > > >
> > > > > > Regards,
> > > > > >
> > > > > > XD
> > > > >
> > > > >
> > > > >
> ---------------------------------------------------------------------
> > > > > To unsubscribe, e-mail: [email protected]
> > > > > For additional commands, e-mail: [email protected]
> > > > >
> > > > >
> > > >
> > >
> > > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: [email protected]
> > > For additional commands, e-mail: [email protected]
> > >
> > >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: [email protected]
> > For additional commands, e-mail: [email protected]
> >
> >
>

Re: AIP - Add "persist_xcom_through_retry" Parameter to Airflow Operators

Reply via email to