Re: AIP - Add "persist_xcom_through_retry" Parameter to Airflow Operators

Xiaodong Deng Thu, 20 Nov 2025 11:00:10 -0800

Hi Daniel!

I totally get you.


It's somehow like we are decorating a big house, which is already full of fancy 
things but somehow a little disorganized. This AIP here is like suggesting 
putting one more bouquet of flowers on the dining table: yes, it may smell & 
look nice, and seems not wrong at all given we have been putting flowers here & 
there all the time. That's why I proceeded to share this AIP here.

But I totally understand the concern that we may want to unify the 
interface/the style/the recommendation/...: a unified decoration style of this 
house.

I still think the feature this AIP suggested here is necessary. I look forward 
to the following-up discussions to make it happen via other better approach and 
the right interface.

Thanks!


XD

On 2025/11/19 22:38:08 Daniel Standish via dev wrote:
> Yeah XD I was just saying that the config feels a little bit like that.
> 
> But my main point was to try to suggest that we evaluate this based on the
> interface design, and not whether it allows for non-idempotent tasks
> because:
> 
>    - Airflow already allows such things
>    - and the concept is somewhat fraught anyway
>    - and I don't think Airflow should be so opinionated about it to go out
>    of its way to forbid features needed for real world workflow
> 
> 
> Just wanted to suggest we focus on, is this the right interface, or a good
> interface, or a good enough interface, etc.
> 
> Hopefully that did not get lost in my other comments.
> 
> Interface design is hard.
> 
> 
> 
> 
> 
> On Wed, Nov 19, 2025 at 1:38 PM Xiaodong Deng <[email protected]> wrote:
> 
> > Hi folks,
> >
> > Thanks a lot for all the valuable inputs.
> >
> > Regarding what @Daniel mentioned, I get what you shared. It's just similar
> > to the concern regarding Configurations: new minor varieties keep stemming
> > from the main ones (like the `Pool.include_deferred` example you
> > mentioned). I fully get you. Does this also mean we should start to have a
> > concrete detailed guideline how we should consider new features? That's
> > possibly worth considering.
> >
> > Regarding what @TP shared:
> > - For use case "External Job Tracking and Polling", yes, the intuition
> > would be an operator + a sensor. In the Confluence doc, we had a line to
> > explain "why not separate job Triggering and Polling into two steps". May
> > or may not be a solid reason.
> > - For other points, you mentioned the better option may be to "making it a
> > separate task". That applies most of the time, I agree, while there can be
> > exceptions (that's why most of us are agreeing State can be a useful
> > feature here, even if we may be proposing different approaches).
> > - In the end I feel there seems existing a recommended way/philosophy of
> > using Airflow ("flexibility" vs. "recommended practice" vs. ...), while
> > it's not clearly summarized anywhere. That's possibly another thing worth
> > considering.
> >
> > Given all the valuable inputs from the folks, I will withdraw this
> > proposal for now. I'm happy to discuss with the folks on the alternative
> > approach.
> >
> > Thanks again!
> >
> >
> > Regards,
> > XD
> >
> >
> > On 2025/11/19 06:28:49 Jens Scheffler wrote:
> > > Hi,
> > >
> > > would add (6) as use case as I made it in the Confluence as comment and
> > > TP highlighted: Add try number and keep history for seeing differences
> > > between runs (as admin for sanity check/history after dag code was e.g.
> > > patched - might be a dowstream task was not re-run and was depending on
> > > an older XCom ... so that would help in case of troubleshooting.
> > >
> > > But in (6) NOT as to have logic based on try_number as this would be
> > > another purpose in my view.
> > >
> > > So in this case I think the discussion is valuable and some extension in
> > > all the listed use cases makes sense to me!
> > >
> > > Jens
> > >
> > > On 11/19/25 07:07, Tzu-ping Chung via dev wrote:
> > > > What I feel is, while it is fine to have more than one way to do a
> > thing, some of the examples do not sufficiently discuss why existing
> > features are not suitable for the use case. This context is important since
> > it would affect how we implement the new feature to sufficiently
> > distinguish it from existing ones, so it is easier to make the correct
> > decision when you are choosing between features to achieve a goal. It is
> > also a good chance for us to take a look at enhancing other existing
> > features so they cover more use cases and work better together.
> > > >
> > > > I’ll try to break down each use case in the appendix. To be clear, I
> > can think of some possibilities for each case why a new feature is
> > preferred, but the problem is the document should sufficiently explore and
> > discuss existing solutions.
> > > >
> > > > 1: Large Dataset Processing with Checkpoints
> > > >
> > > > It is unclear from the example how the use case cannot be satisfied by
> > dynamic task mapping:
> > > >
> > > >      @task
> > > >      def process_record(record): ...
> > > >
> > > >      @task(trigger_rule="always")
> > > >      def summary(results): ...
> > > >
> > > >      results = process_record.expand(record=get_records_to_process())
> > > >      summary(results)
> > > >
> > > > 2: External Job Tracking and Polling
> > > >
> > > > This looks like a use case for sensors to me.
> > > >
> > > > 3: More Efficient API Integration
> > > >
> > > > Why does make_api_calls need to be in the same task? All existing
> > patterns in Airflow  point to making it a separate task.
> > > >
> > > > 4: Resource Management and Cleanup
> > > >
> > > > Isn’t this what teardown tasks are for?
> > > >
> > > > 5: Adaptive Processing with Learning
> > > >
> > > > This is the use case that I feel the proposal is most useful for.
> > However, it can also be satisfied by Variable, or the state persistence
> > mechanism mentioned by Ash.
> > > >
> > > > In some ways, the three are really the same thing—a way to keep
> > context—except they have different scopes. Variable has the global (to the
> > Airflow instance) scope, XCom the task runner process scope (almost task
> > instance scope but not quite since it’s cleared for retry). StateVariable
> > is also global as currently proposed, but from the listed use cases, it is
> > arguably more suitable to be task- or dag-scoped (not to be confused to
> > being scoped to a task instance or dag run).
> > > >
> > > > Back to the proposal at hand, the way I understand
> > persist_xcom_through_retry is it essentially switches all XComs pushed in
> > the task from being scoped by the task instance *try* to the task instance
> > (across all tries). I think the idea itself is worth having, and having a
> > task-level flag may be a good way to expose it to users. However, I feel
> > there are some choices we can still discuss on what the feature actually
> > means beyond having a flag that does one specific thing internally.
> > > >
> > > > For example, perhaps we should remodel XComModel to include a
> > try_number, and allow it to be scoped both against a ti or a ti try?
> > Potentially even more choices such as task-scoped across runs, or globally
> > by unifying Variable? There are many open questions from my point of view,
> > and again, I feel the proposal document should discuss the use cases in
> > more detail to pin down the specifics, instead of leaving things out for
> > interpretation.
> > > >
> > > > TP
> > > >
> > > >
> > > >> On 19 Nov 2025, at 06:20, Xiaodong Deng <[email protected]> wrote:
> > > >>
> > > >> Thanks for your valuable feedback, folks.
> > > >>
> > > >> Hi @TP,
> > > >>
> > > >> There are cases where breaking down to multiple tasks is not feasible
> > or not the best option. For example, the use case 1 I have shared in the
> > Confluence doc appendix.
> > > >>
> > > >> There are also examples where splitting into multiple tasks may seem
> > make sense but may cause down-side effect. In use case 2 and 4 in the
> > Confluence doc appendix, I shared why we do it in a single task instead of
> > splitting them into two tasks.
> > > >>
> > > >> Some tasks are simply atomic.
> > > >>
> > > >>
> > > >> Hi @Jarek,
> > > >>
> > > >> I'm glad we are talking about idempotency. That's exactly why
> > sometimes we cannot break down some tasks. In the "Problem Examples"
> > section in the Confluence doc, I covered that at some extent.
> > > >>
> > > >> Would love to discuss more on this, or learn from you for any
> > alternative solutions which can become available to Airflow users in a
> > timely manner.
> > > >>
> > > >> Many thanks!
> > > >>
> > > >>
> > > >> Regards,
> > > >> XD
> > > >>
> > > >> On 2025/11/16 09:48:10 Jarek Potiuk wrote:
> > > >>> I agree with TP wholeheartedly. The basic reason why XCom is deleted
> > when
> > > >>> restarting is to maintain idempotency principles. And if we allow
> > XCom to
> > > >>> be used to break idempotency (that's basically what state per task is
> > > >>> about) - then XCom will stop serving its purpose.
> > > >>>
> > > >>> And of course - we are in the new "world" where we are not only
> > supporting
> > > >>> idempotent tasks, Various optimisations and different kinds of
> > workloads
> > > >>> require breaking the "old" idempotency rules we used to have when
> > Airflow
> > > >>> was used mainly for ETL. And deletion of XCom state was also
> > questioned
> > > >>> back then because people **wanted** to use Xcom in other ways. But
> > we held
> > > >>> strongly and I think that was a good choice.
> > > >>>
> > > >>> And while repurposing XCom to do "something" else might seem like a
> > good
> > > >>> idea - even for Apple, because they could internally agree to some
> > > >>> convention and use it as "solution". But when you look at Airflow as
> > a
> > > >>> product, repurposing XCome to also do something else (i.e. storing
> > state)
> > > >>> seems a bit "lazy" and "short-cut-y".
> > > >>>
> > > >>> What does it save if you do it this way? Few things:
> > > >>>
> > > >>> * not having to do database migration to implement new feature
> > > >>> * avoiding having a clearly defined API where state can be stored for
> > > >>> various purposes on different levels (Task Instance, Task, Task Group
> > > >>> maybe, Dag, Team eventually)
> > > >>> * avoiding to think and prepare for all the various use cases that
> > people
> > > >>> really would like to use it
> > > >>> * avoiding to write the use-case documentation explaining how you
> > can use
> > > >>> state
> > > >>> * avoiding to write all the test cases making sure that all those
> > use cases
> > > >>> are served way
> > > >>> * not thinking too much about performance and security implications
> > of
> > > >>> those ("Xcom has it already sorted out, I am sure it's going to be
> > fine")
> > > >>>
> > > >>> Yes, it can be done way faster this way. and I understand some
> > commercial
> > > >>> users could have chosen this way as a shortcut to handle a specific
> > use
> > > >>> case they had in mind. This is absolutely understandable, and this
> > is what
> > > >>> I would even expect a for-profit company to do to increase so-called
> > > >>> "time-to-market" and start reaping the benefits of it faster.
> > > >>>
> > > >>> But should we do it in Airflow the same way ? We are not a for-profit
> > > >>> company, time-to-market of such a feature is secondary, compared to
> > the
> > > >>> stability, maintainability and having a "product" vision.
> > > >>> I consider all the above points as absolutely crucial properties of a
> > > >>> "product" - which Airflow is. They might not be needed in a
> > "solution", but
> > > >>> having a good "product" - absolutely requires all those things,
> > > >>>
> > > >>> When we switched to Airflow 3, one of the ideas was to remove all
> > the bad
> > > >>> "solution-y" decisions we made in the past that slowed us down in
> > general
> > > >>> and - more importantly - turned us into (as Daniel used to say) into
> > > >>> "back-compatibility engineers"
> > > >>>
> > > >>> Does it mean it will take longer and require more dedication and
> > effort
> > > >>> and discussions to agree on the scope ? Absolutely. Is this a bad
> > thing? I
> > > >>> don't think so.
> > > >>>
> > > >>> J.
> > > >>>
> > > >>>
> > > >>> On Sun, Nov 16, 2025 at 9:43 AM Tzu-ping Chung via dev <
> > > >>> [email protected]> wrote:
> > > >>>
> > > >>>> What is the motivation behind storing internal state in a task,
> > instead of
> > > >>>> splitting the logic on state boundaries into multiple tasks? That’s
> > what
> > > >>>> the task abstraction is supposed for, and you wouldn’t need to a
> > separate
> > > >>>> mechanism for that—regular XCom would just work.
> > > >>>>
> > > >>>> While storing state is a legitimate use case, I feel this
> > particular idea
> > > >>>> would have a more negative impact on encouraging people to do too
> > many
> > > >>>> things in one task. I’d even argue the examples given in the
> > Confluence
> > > >>>> document are already so.
> > > >>>>
> > > >>>> TP
> > > >>>>
> > > >>>>
> > > >>>>> On 14 Nov 2025, at 08:32, Xiaodong Deng <[email protected]> wrote:
> > > >>>>>
> > > >>>>> Hi folks!
> > > >>>>>
> > > >>>>> We would like to propose a new feature in Airflow, a boolean
> > > >>>>> parameter  "persist_xcom_through_retry" Parameter in all Airflow
> > > >>>> Operators.
> > > >>>>> Our team added this feature in our internal fork a few years back,
> > and it
> > > >>>>> has been benefiting our users extensively.
> > > >>>>>
> > > >>>>> *I have created an AIP
> > > >>>>> at
> > > >>>>
> > https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=399278333
> > > >>>>> <
> > > >>>>
> > https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=399278333
> > > >>>>> *.
> > > >>>>> Below is a summary (in the complete AIP, we have a more detailed
> > problem
> > > >>>>> statement and quite a few interesting use-case examples):
> > > >>>>>
> > > >>>>>
> > > >>>>>
> > > >>>>>
> > > >>>>> *Traditionally, XCom is defined as “a mechanism that lets Tasks
> > talk to
> > > >>>>> each other”. However, XCom also has the capacity and potential to
> > help
> > > >>>>> persist and manage task state within a task itself.Currently,
> > Apache
> > > >>>>> Airflow automatically clears a task instance’s XCom data when it is
> > > >>>>> retried. This behavior, while ensuring clean state for retry
> > attempts,
> > > >>>>> creates limitations:*
> > > >>>>>
> > > >>>>>   - *Loss of Internal Progress: Tasks that have internal
> > checkpointing or
> > > >>>>>   progress tracking lose all intermediate state on retry, forcing
> > restart
> > > >>>>>   from the beginning.*
> > > >>>>>   - *Resource State Loss: Tasks cannot maintain state about
> > allocated
> > > >>>>>   resources (compute instances, downstream job IDs, etc.) across
> > retry
> > > >>>>>   attempts, leading to redundant expensive setup operations.*
> > > >>>>>   - *No Recovery/Resume Capability: There's no way for tasks to
> > resume
> > > >>>>>   from internal checkpoints when transient failures occur during
> > > >>>>>   long-running atomicoperations.*
> > > >>>>>   - *Poor User Experience: users must implement external state
> > management
> > > >>>>>   systems to work around this limitation, adding complexity to DAG
> > > >>>> authoring.*
> > > >>>>>
> > > >>>>> *This proposal aims at extending the capacity of XCom by allowing
> > > >>>>> persisting a Task Instance’s XCom through its retries, enabling
> > users to
> > > >>>>> build more resilient and efficient pipelines. This is particularly
> > useful
> > > >>>>> for the type of tasks which are atomic (so one such task cannot be
> > split
> > > >>>>> into multiple tasks) and need to manage internal state or
> > checkpoints. *
> > > >>>>>
> > > >>>>>
> > > >>>>> We look forward to your feedback and thoughts. Thanks!
> > > >>>>>
> > > >>>>>
> > > >>>>> Regards,
> > > >>>>>
> > > >>>>> XD
> > > >>>>
> > > >>>>
> > ---------------------------------------------------------------------
> > > >>>> To unsubscribe, e-mail: [email protected]
> > > >>>> For additional commands, e-mail: [email protected]
> > > >>>>
> > > >>>>
> > > >> ---------------------------------------------------------------------
> > > >> To unsubscribe, e-mail: [email protected]
> > > >> For additional commands, e-mail: [email protected]
> > > >>
> > > >
> > > > ---------------------------------------------------------------------
> > > > To unsubscribe, e-mail: [email protected]
> > > > For additional commands, e-mail: [email protected]
> > > >
> > >
> > > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: [email protected]
> > > For additional commands, e-mail: [email protected]
> > >
> > >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: [email protected]
> > For additional commands, e-mail: [email protected]
> >
> >
> 

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: AIP - Add "persist_xcom_through_retry" Parameter to Airflow Operators

Reply via email to