XD,

I'd be happy to chat with you about some of the work that I've been doing
as part of AIP-93. We'd discussed breaking the AIP into two phases; the
first is a generic state-store that could be used like Variable and/or XCom
(but more like Variable). The second was to leverage that state-store to
provide a unified experience for Airflow users/developers interested in
building Asset "Watchers". Hit me up in Slack, and we can talk a bit more.

Thanks,
Jake

On Wed, Nov 19, 2025 at 4:24 PM Xiaodong Deng <[email protected]> wrote:

> Hi Jarek,
>
> If you don't mind, I would suggest not to conclude the proposal as a
> "time-to-market" thing vs. "good product". The team put thoughts & efforts
> in it hoping for a "good product" too. It's a discussion & learning process
> for everyone, and "good"/"bad" is still somehow subjective.
>
> As a long-time member of this community, I'm more than sure you don't mean
> anything negative. But for folks newly engaged or in similar situation,
> this may sound a little bit discouraging ;-)
>
>
> Regards,
> XD
>
> On 2025/11/18 22:45:52 Jarek Potiuk wrote:
> > Proposed Alternative:
> >
> > Complete and propose a regular "state" storage proposal - there were
> plenty
> > of discussions about that - including Asset Watermarks that Ash
> mentioned.
> > I think the best way is to lead that discussion to completion, and as
> > result come up with a state management that can be used in this case as
> > well.
> >
> > As mentioned in my previous - mail - my thinking we are not in
> > "time-to-market" game. We are more in "delliver good product".  If it
> will
> > take more time, so be it, but let's do it properly. There is not much to
> > loose by having it later, but there is a lot to loose collectively if our
> > users will start misusing half-backed feature that will mislead them to
> do
> > something we do not want them to do.
> >
> > J.
> >
> >
> > On Tue, Nov 18, 2025 at 11:25 PM Xiaodong Deng <[email protected]>
> wrote:
> >
> > > In addition, I understand we would like to stick to certain
> > > design/principles. However, if that is blocking certain reasonable use
> > > cases, either alternative solutions need to be provided or "principles"
> > > need to be adjusted.
> > >
> > > That's what I'm hoping for here.
> > >
> > > Thanks again!
> > >
> > >
> > > Regards,
> > > XD
> > >
> > > On 2025/11/18 22:20:36 Xiaodong Deng wrote:
> > > > Thanks for your valuable feedback, folks.
> > > >
> > > > Hi @TP,
> > > >
> > > > There are cases where breaking down to multiple tasks is not
> feasible or
> > > not the best option. For example, the use case 1 I have shared in the
> > > Confluence doc appendix.
> > > >
> > > > There are also examples where splitting into multiple tasks may seem
> > > make sense but may cause down-side effect. In use case 2 and 4 in the
> > > Confluence doc appendix, I shared why we do it in a single task
> instead of
> > > splitting them into two tasks.
> > > >
> > > > Some tasks are simply atomic.
> > > >
> > > >
> > > > Hi @Jarek,
> > > >
> > > > I'm glad we are talking about idempotency. That's exactly why
> sometimes
> > > we cannot break down some tasks. In the "Problem Examples" section in
> the
> > > Confluence doc, I covered that at some extent.
> > > >
> > > > Would love to discuss more on this, or learn from you for any
> > > alternative solutions which can become available to Airflow users in a
> > > timely manner.
> > > >
> > > > Many thanks!
> > > >
> > > >
> > > > Regards,
> > > > XD
> > > >
> > > > On 2025/11/16 09:48:10 Jarek Potiuk wrote:
> > > > > I agree with TP wholeheartedly. The basic reason why XCom is
> deleted
> > > when
> > > > > restarting is to maintain idempotency principles. And if we allow
> XCom
> > > to
> > > > > be used to break idempotency (that's basically what state per task
> is
> > > > > about) - then XCom will stop serving its purpose.
> > > > >
> > > > > And of course - we are in the new "world" where we are not only
> > > supporting
> > > > > idempotent tasks, Various optimisations and different kinds of
> > > workloads
> > > > > require breaking the "old" idempotency rules we used to have when
> > > Airflow
> > > > > was used mainly for ETL. And deletion of XCom state was also
> questioned
> > > > > back then because people **wanted** to use Xcom in other ways. But
> we
> > > held
> > > > > strongly and I think that was a good choice.
> > > > >
> > > > > And while repurposing XCom to do "something" else might seem like a
> > > good
> > > > > idea - even for Apple, because they could internally agree to some
> > > > > convention and use it as "solution". But when you look at Airflow
> as a
> > > > > product, repurposing XCome to also do something else (i.e. storing
> > > state)
> > > > > seems a bit "lazy" and "short-cut-y".
> > > > >
> > > > > What does it save if you do it this way? Few things:
> > > > >
> > > > > * not having to do database migration to implement new feature
> > > > > * avoiding having a clearly defined API where state can be stored
> for
> > > > > various purposes on different levels (Task Instance, Task, Task
> Group
> > > > > maybe, Dag, Team eventually)
> > > > > * avoiding to think and prepare for all the various use cases that
> > > people
> > > > > really would like to use it
> > > > > * avoiding to write the use-case documentation explaining how you
> can
> > > use
> > > > > state
> > > > > * avoiding to write all the test cases making sure that all those
> use
> > > cases
> > > > > are served way
> > > > > * not thinking too much about performance and security
> implications of
> > > > > those ("Xcom has it already sorted out, I am sure it's going to be
> > > fine")
> > > > >
> > > > > Yes, it can be done way faster this way. and I understand some
> > > commercial
> > > > > users could have chosen this way as a shortcut to handle a
> specific use
> > > > > case they had in mind. This is absolutely understandable, and this
> is
> > > what
> > > > > I would even expect a for-profit company to do to increase
> so-called
> > > > > "time-to-market" and start reaping the benefits of it faster.
> > > > >
> > > > > But should we do it in Airflow the same way ? We are not a
> for-profit
> > > > > company, time-to-market of such a feature is secondary, compared
> to the
> > > > > stability, maintainability and having a "product" vision.
> > > > > I consider all the above points as absolutely crucial properties
> of a
> > > > > "product" - which Airflow is. They might not be needed in a
> > > "solution", but
> > > > > having a good "product" - absolutely requires all those things,
> > > > >
> > > > > When we switched to Airflow 3, one of the ideas was to remove all
> the
> > > bad
> > > > > "solution-y" decisions we made in the past that slowed us down in
> > > general
> > > > > and - more importantly - turned us into (as Daniel used to say)
> into
> > > > > "back-compatibility engineers"
> > > > >
> > > > > Does it mean it will take longer and require more dedication and
> effort
> > > > > and discussions to agree on the scope ? Absolutely. Is this a bad
> > > thing? I
> > > > > don't think so.
> > > > >
> > > > > J.
> > > > >
> > > > >
> > > > > On Sun, Nov 16, 2025 at 9:43 AM Tzu-ping Chung via dev <
> > > > > [email protected]> wrote:
> > > > >
> > > > > > What is the motivation behind storing internal state in a task,
> > > instead of
> > > > > > splitting the logic on state boundaries into multiple tasks?
> That’s
> > > what
> > > > > > the task abstraction is supposed for, and you wouldn’t need to a
> > > separate
> > > > > > mechanism for that—regular XCom would just work.
> > > > > >
> > > > > > While storing state is a legitimate use case, I feel this
> particular
> > > idea
> > > > > > would have a more negative impact on encouraging people to do too
> > > many
> > > > > > things in one task. I’d even argue the examples given in the
> > > Confluence
> > > > > > document are already so.
> > > > > >
> > > > > > TP
> > > > > >
> > > > > >
> > > > > > > On 14 Nov 2025, at 08:32, Xiaodong Deng <[email protected]>
> wrote:
> > > > > > >
> > > > > > > Hi folks!
> > > > > > >
> > > > > > > We would like to propose a new feature in Airflow, a boolean
> > > > > > > parameter  "persist_xcom_through_retry" Parameter in all
> Airflow
> > > > > > Operators.
> > > > > > > Our team added this feature in our internal fork a few years
> back,
> > > and it
> > > > > > > has been benefiting our users extensively.
> > > > > > >
> > > > > > > *I have created an AIP
> > > > > > > at
> > > > > >
> > >
> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=399278333
> > > > > > > <
> > > > > >
> > >
> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=399278333
> > > > > > >*.
> > > > > > > Below is a summary (in the complete AIP, we have a more
> detailed
> > > problem
> > > > > > > statement and quite a few interesting use-case examples):
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > *Traditionally, XCom is defined as “a mechanism that lets Tasks
> > > talk to
> > > > > > > each other”. However, XCom also has the capacity and potential
> to
> > > help
> > > > > > > persist and manage task state within a task itself.Currently,
> > > Apache
> > > > > > > Airflow automatically clears a task instance’s XCom data when
> it is
> > > > > > > retried. This behavior, while ensuring clean state for retry
> > > attempts,
> > > > > > > creates limitations:*
> > > > > > >
> > > > > > >   - *Loss of Internal Progress: Tasks that have internal
> > > checkpointing or
> > > > > > >   progress tracking lose all intermediate state on retry,
> forcing
> > > restart
> > > > > > >   from the beginning.*
> > > > > > >   - *Resource State Loss: Tasks cannot maintain state about
> > > allocated
> > > > > > >   resources (compute instances, downstream job IDs, etc.)
> across
> > > retry
> > > > > > >   attempts, leading to redundant expensive setup operations.*
> > > > > > >   - *No Recovery/Resume Capability: There's no way for tasks to
> > > resume
> > > > > > >   from internal checkpoints when transient failures occur
> during
> > > > > > >   long-running atomicoperations.*
> > > > > > >   - *Poor User Experience: users must implement external state
> > > management
> > > > > > >   systems to work around this limitation, adding complexity to
> DAG
> > > > > > authoring.*
> > > > > > >
> > > > > > >
> > > > > > > *This proposal aims at extending the capacity of XCom by
> allowing
> > > > > > > persisting a Task Instance’s XCom through its retries, enabling
> > > users to
> > > > > > > build more resilient and efficient pipelines. This is
> particularly
> > > useful
> > > > > > > for the type of tasks which are atomic (so one such task
> cannot be
> > > split
> > > > > > > into multiple tasks) and need to manage internal state or
> > > checkpoints. *
> > > > > > >
> > > > > > >
> > > > > > > We look forward to your feedback and thoughts. Thanks!
> > > > > > >
> > > > > > >
> > > > > > > Regards,
> > > > > > >
> > > > > > > XD
> > > > > >
> > > > > >
> > > > > >
> ---------------------------------------------------------------------
> > > > > > To unsubscribe, e-mail: [email protected]
> > > > > > For additional commands, e-mail: [email protected]
> > > > > >
> > > > > >
> > > > >
> > > >
> > > > ---------------------------------------------------------------------
> > > > To unsubscribe, e-mail: [email protected]
> > > > For additional commands, e-mail: [email protected]
> > > >
> > > >
> > >
> > > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: [email protected]
> > > For additional commands, e-mail: [email protected]
> > >
> > >
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
>
>

Reply via email to