Re: AIP - Add "persist_xcom_through_retry" Parameter to Airflow Operators

Xiaodong Deng Wed, 19 Nov 2025 13:24:22 -0800

Hi Jarek,

If you don't mind, I would suggest not to conclude the proposal as a 
"time-to-market" thing vs. "good product". The team put thoughts & efforts in 
it hoping for a "good product" too. It's a discussion & learning process for 
everyone, and "good"/"bad" is still somehow subjective.


As a long-time member of this community, I'm more than sure you don't mean 
anything negative. But for folks newly engaged or in similar situation, this 
may sound a little bit discouraging ;-)


Regards,
XD

On 2025/11/18 22:45:52 Jarek Potiuk wrote:
> Proposed Alternative:
> 
> Complete and propose a regular "state" storage proposal - there were plenty
> of discussions about that - including Asset Watermarks that Ash mentioned.
> I think the best way is to lead that discussion to completion, and as
> result come up with a state management that can be used in this case as
> well.
> 
> As mentioned in my previous - mail - my thinking we are not in
> "time-to-market" game. We are more in "delliver good product".  If it will
> take more time, so be it, but let's do it properly. There is not much to
> loose by having it later, but there is a lot to loose collectively if our
> users will start misusing half-backed feature that will mislead them to do
> something we do not want them to do.
> 
> J.
> 
> 
> On Tue, Nov 18, 2025 at 11:25 PM Xiaodong Deng <[email protected]> wrote:
> 
> > In addition, I understand we would like to stick to certain
> > design/principles. However, if that is blocking certain reasonable use
> > cases, either alternative solutions need to be provided or "principles"
> > need to be adjusted.
> >
> > That's what I'm hoping for here.
> >
> > Thanks again!
> >
> >
> > Regards,
> > XD
> >
> > On 2025/11/18 22:20:36 Xiaodong Deng wrote:
> > > Thanks for your valuable feedback, folks.
> > >
> > > Hi @TP,
> > >
> > > There are cases where breaking down to multiple tasks is not feasible or
> > not the best option. For example, the use case 1 I have shared in the
> > Confluence doc appendix.
> > >
> > > There are also examples where splitting into multiple tasks may seem
> > make sense but may cause down-side effect. In use case 2 and 4 in the
> > Confluence doc appendix, I shared why we do it in a single task instead of
> > splitting them into two tasks.
> > >
> > > Some tasks are simply atomic.
> > >
> > >
> > > Hi @Jarek,
> > >
> > > I'm glad we are talking about idempotency. That's exactly why sometimes
> > we cannot break down some tasks. In the "Problem Examples" section in the
> > Confluence doc, I covered that at some extent.
> > >
> > > Would love to discuss more on this, or learn from you for any
> > alternative solutions which can become available to Airflow users in a
> > timely manner.
> > >
> > > Many thanks!
> > >
> > >
> > > Regards,
> > > XD
> > >
> > > On 2025/11/16 09:48:10 Jarek Potiuk wrote:
> > > > I agree with TP wholeheartedly. The basic reason why XCom is deleted
> > when
> > > > restarting is to maintain idempotency principles. And if we allow XCom
> > to
> > > > be used to break idempotency (that's basically what state per task is
> > > > about) - then XCom will stop serving its purpose.
> > > >
> > > > And of course - we are in the new "world" where we are not only
> > supporting
> > > > idempotent tasks, Various optimisations and different kinds of
> > workloads
> > > > require breaking the "old" idempotency rules we used to have when
> > Airflow
> > > > was used mainly for ETL. And deletion of XCom state was also questioned
> > > > back then because people **wanted** to use Xcom in other ways. But we
> > held
> > > > strongly and I think that was a good choice.
> > > >
> > > > And while repurposing XCom to do "something" else might seem like a
> > good
> > > > idea - even for Apple, because they could internally agree to some
> > > > convention and use it as "solution". But when you look at Airflow as a
> > > > product, repurposing XCome to also do something else (i.e. storing
> > state)
> > > > seems a bit "lazy" and "short-cut-y".
> > > >
> > > > What does it save if you do it this way? Few things:
> > > >
> > > > * not having to do database migration to implement new feature
> > > > * avoiding having a clearly defined API where state can be stored for
> > > > various purposes on different levels (Task Instance, Task, Task Group
> > > > maybe, Dag, Team eventually)
> > > > * avoiding to think and prepare for all the various use cases that
> > people
> > > > really would like to use it
> > > > * avoiding to write the use-case documentation explaining how you can
> > use
> > > > state
> > > > * avoiding to write all the test cases making sure that all those use
> > cases
> > > > are served way
> > > > * not thinking too much about performance and security implications of
> > > > those ("Xcom has it already sorted out, I am sure it's going to be
> > fine")
> > > >
> > > > Yes, it can be done way faster this way. and I understand some
> > commercial
> > > > users could have chosen this way as a shortcut to handle a specific use
> > > > case they had in mind. This is absolutely understandable, and this is
> > what
> > > > I would even expect a for-profit company to do to increase so-called
> > > > "time-to-market" and start reaping the benefits of it faster.
> > > >
> > > > But should we do it in Airflow the same way ? We are not a for-profit
> > > > company, time-to-market of such a feature is secondary, compared to the
> > > > stability, maintainability and having a "product" vision.
> > > > I consider all the above points as absolutely crucial properties of a
> > > > "product" - which Airflow is. They might not be needed in a
> > "solution", but
> > > > having a good "product" - absolutely requires all those things,
> > > >
> > > > When we switched to Airflow 3, one of the ideas was to remove all the
> > bad
> > > > "solution-y" decisions we made in the past that slowed us down in
> > general
> > > > and - more importantly - turned us into (as Daniel used to say) into
> > > > "back-compatibility engineers"
> > > >
> > > > Does it mean it will take longer and require more dedication and effort
> > > > and discussions to agree on the scope ? Absolutely. Is this a bad
> > thing? I
> > > > don't think so.
> > > >
> > > > J.
> > > >
> > > >
> > > > On Sun, Nov 16, 2025 at 9:43 AM Tzu-ping Chung via dev <
> > > > [email protected]> wrote:
> > > >
> > > > > What is the motivation behind storing internal state in a task,
> > instead of
> > > > > splitting the logic on state boundaries into multiple tasks? That’s
> > what
> > > > > the task abstraction is supposed for, and you wouldn’t need to a
> > separate
> > > > > mechanism for that—regular XCom would just work.
> > > > >
> > > > > While storing state is a legitimate use case, I feel this particular
> > idea
> > > > > would have a more negative impact on encouraging people to do too
> > many
> > > > > things in one task. I’d even argue the examples given in the
> > Confluence
> > > > > document are already so.
> > > > >
> > > > > TP
> > > > >
> > > > >
> > > > > > On 14 Nov 2025, at 08:32, Xiaodong Deng <[email protected]> wrote:
> > > > > >
> > > > > > Hi folks!
> > > > > >
> > > > > > We would like to propose a new feature in Airflow, a boolean
> > > > > > parameter  "persist_xcom_through_retry" Parameter in all Airflow
> > > > > Operators.
> > > > > > Our team added this feature in our internal fork a few years back,
> > and it
> > > > > > has been benefiting our users extensively.
> > > > > >
> > > > > > *I have created an AIP
> > > > > > at
> > > > >
> > https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=399278333
> > > > > > <
> > > > >
> > https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=399278333
> > > > > >*.
> > > > > > Below is a summary (in the complete AIP, we have a more detailed
> > problem
> > > > > > statement and quite a few interesting use-case examples):
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > > *Traditionally, XCom is defined as “a mechanism that lets Tasks
> > talk to
> > > > > > each other”. However, XCom also has the capacity and potential to
> > help
> > > > > > persist and manage task state within a task itself.Currently,
> > Apache
> > > > > > Airflow automatically clears a task instance’s XCom data when it is
> > > > > > retried. This behavior, while ensuring clean state for retry
> > attempts,
> > > > > > creates limitations:*
> > > > > >
> > > > > >   - *Loss of Internal Progress: Tasks that have internal
> > checkpointing or
> > > > > >   progress tracking lose all intermediate state on retry, forcing
> > restart
> > > > > >   from the beginning.*
> > > > > >   - *Resource State Loss: Tasks cannot maintain state about
> > allocated
> > > > > >   resources (compute instances, downstream job IDs, etc.) across
> > retry
> > > > > >   attempts, leading to redundant expensive setup operations.*
> > > > > >   - *No Recovery/Resume Capability: There's no way for tasks to
> > resume
> > > > > >   from internal checkpoints when transient failures occur during
> > > > > >   long-running atomicoperations.*
> > > > > >   - *Poor User Experience: users must implement external state
> > management
> > > > > >   systems to work around this limitation, adding complexity to DAG
> > > > > authoring.*
> > > > > >
> > > > > >
> > > > > > *This proposal aims at extending the capacity of XCom by allowing
> > > > > > persisting a Task Instance’s XCom through its retries, enabling
> > users to
> > > > > > build more resilient and efficient pipelines. This is particularly
> > useful
> > > > > > for the type of tasks which are atomic (so one such task cannot be
> > split
> > > > > > into multiple tasks) and need to manage internal state or
> > checkpoints. *
> > > > > >
> > > > > >
> > > > > > We look forward to your feedback and thoughts. Thanks!
> > > > > >
> > > > > >
> > > > > > Regards,
> > > > > >
> > > > > > XD
> > > > >
> > > > >
> > > > > ---------------------------------------------------------------------
> > > > > To unsubscribe, e-mail: [email protected]
> > > > > For additional commands, e-mail: [email protected]
> > > > >
> > > > >
> > > >
> > >
> > > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: [email protected]
> > > For additional commands, e-mail: [email protected]
> > >
> > >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: [email protected]
> > For additional commands, e-mail: [email protected]
> >
> >
> 

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: AIP - Add "persist_xcom_through_retry" Parameter to Airflow Operators

Reply via email to