Re: AIP - Add "persist_xcom_through_retry" Parameter to Airflow Operators

Daniel Standish via dev Wed, 19 Nov 2025 14:38:36 -0800

Yeah XD I was just saying that the config feels a little bit like that.

But my main point was to try to suggest that we evaluate this based on the
interface design, and not whether it allows for non-idempotent tasks
because:


   - Airflow already allows such things
   - and the concept is somewhat fraught anyway
   - and I don't think Airflow should be so opinionated about it to go out
   of its way to forbid features needed for real world workflow


Just wanted to suggest we focus on, is this the right interface, or a good
interface, or a good enough interface, etc.

Hopefully that did not get lost in my other comments.

Interface design is hard.





On Wed, Nov 19, 2025 at 1:38 PM Xiaodong Deng <[email protected]> wrote:

> Hi folks,
>
> Thanks a lot for all the valuable inputs.
>
> Regarding what @Daniel mentioned, I get what you shared. It's just similar
> to the concern regarding Configurations: new minor varieties keep stemming
> from the main ones (like the `Pool.include_deferred` example you
> mentioned). I fully get you. Does this also mean we should start to have a
> concrete detailed guideline how we should consider new features? That's
> possibly worth considering.
>
> Regarding what @TP shared:
> - For use case "External Job Tracking and Polling", yes, the intuition
> would be an operator + a sensor. In the Confluence doc, we had a line to
> explain "why not separate job Triggering and Polling into two steps". May
> or may not be a solid reason.
> - For other points, you mentioned the better option may be to "making it a
> separate task". That applies most of the time, I agree, while there can be
> exceptions (that's why most of us are agreeing State can be a useful
> feature here, even if we may be proposing different approaches).
> - In the end I feel there seems existing a recommended way/philosophy of
> using Airflow ("flexibility" vs. "recommended practice" vs. ...), while
> it's not clearly summarized anywhere. That's possibly another thing worth
> considering.
>
> Given all the valuable inputs from the folks, I will withdraw this
> proposal for now. I'm happy to discuss with the folks on the alternative
> approach.
>
> Thanks again!
>
>
> Regards,
> XD
>
>
> On 2025/11/19 06:28:49 Jens Scheffler wrote:
> > Hi,
> >
> > would add (6) as use case as I made it in the Confluence as comment and
> > TP highlighted: Add try number and keep history for seeing differences
> > between runs (as admin for sanity check/history after dag code was e.g.
> > patched - might be a dowstream task was not re-run and was depending on
> > an older XCom ... so that would help in case of troubleshooting.
> >
> > But in (6) NOT as to have logic based on try_number as this would be
> > another purpose in my view.
> >
> > So in this case I think the discussion is valuable and some extension in
> > all the listed use cases makes sense to me!
> >
> > Jens
> >
> > On 11/19/25 07:07, Tzu-ping Chung via dev wrote:
> > > What I feel is, while it is fine to have more than one way to do a
> thing, some of the examples do not sufficiently discuss why existing
> features are not suitable for the use case. This context is important since
> it would affect how we implement the new feature to sufficiently
> distinguish it from existing ones, so it is easier to make the correct
> decision when you are choosing between features to achieve a goal. It is
> also a good chance for us to take a look at enhancing other existing
> features so they cover more use cases and work better together.
> > >
> > > I’ll try to break down each use case in the appendix. To be clear, I
> can think of some possibilities for each case why a new feature is
> preferred, but the problem is the document should sufficiently explore and
> discuss existing solutions.
> > >
> > > 1: Large Dataset Processing with Checkpoints
> > >
> > > It is unclear from the example how the use case cannot be satisfied by
> dynamic task mapping:
> > >
> > >      @task
> > >      def process_record(record): ...
> > >
> > >      @task(trigger_rule="always")
> > >      def summary(results): ...
> > >
> > >      results = process_record.expand(record=get_records_to_process())
> > >      summary(results)
> > >
> > > 2: External Job Tracking and Polling
> > >
> > > This looks like a use case for sensors to me.
> > >
> > > 3: More Efficient API Integration
> > >
> > > Why does make_api_calls need to be in the same task? All existing
> patterns in Airflow  point to making it a separate task.
> > >
> > > 4: Resource Management and Cleanup
> > >
> > > Isn’t this what teardown tasks are for?
> > >
> > > 5: Adaptive Processing with Learning
> > >
> > > This is the use case that I feel the proposal is most useful for.
> However, it can also be satisfied by Variable, or the state persistence
> mechanism mentioned by Ash.
> > >
> > > In some ways, the three are really the same thing—a way to keep
> context—except they have different scopes. Variable has the global (to the
> Airflow instance) scope, XCom the task runner process scope (almost task
> instance scope but not quite since it’s cleared for retry). StateVariable
> is also global as currently proposed, but from the listed use cases, it is
> arguably more suitable to be task- or dag-scoped (not to be confused to
> being scoped to a task instance or dag run).
> > >
> > > Back to the proposal at hand, the way I understand
> persist_xcom_through_retry is it essentially switches all XComs pushed in
> the task from being scoped by the task instance *try* to the task instance
> (across all tries). I think the idea itself is worth having, and having a
> task-level flag may be a good way to expose it to users. However, I feel
> there are some choices we can still discuss on what the feature actually
> means beyond having a flag that does one specific thing internally.
> > >
> > > For example, perhaps we should remodel XComModel to include a
> try_number, and allow it to be scoped both against a ti or a ti try?
> Potentially even more choices such as task-scoped across runs, or globally
> by unifying Variable? There are many open questions from my point of view,
> and again, I feel the proposal document should discuss the use cases in
> more detail to pin down the specifics, instead of leaving things out for
> interpretation.
> > >
> > > TP
> > >
> > >
> > >> On 19 Nov 2025, at 06:20, Xiaodong Deng <[email protected]> wrote:
> > >>
> > >> Thanks for your valuable feedback, folks.
> > >>
> > >> Hi @TP,
> > >>
> > >> There are cases where breaking down to multiple tasks is not feasible
> or not the best option. For example, the use case 1 I have shared in the
> Confluence doc appendix.
> > >>
> > >> There are also examples where splitting into multiple tasks may seem
> make sense but may cause down-side effect. In use case 2 and 4 in the
> Confluence doc appendix, I shared why we do it in a single task instead of
> splitting them into two tasks.
> > >>
> > >> Some tasks are simply atomic.
> > >>
> > >>
> > >> Hi @Jarek,
> > >>
> > >> I'm glad we are talking about idempotency. That's exactly why
> sometimes we cannot break down some tasks. In the "Problem Examples"
> section in the Confluence doc, I covered that at some extent.
> > >>
> > >> Would love to discuss more on this, or learn from you for any
> alternative solutions which can become available to Airflow users in a
> timely manner.
> > >>
> > >> Many thanks!
> > >>
> > >>
> > >> Regards,
> > >> XD
> > >>
> > >> On 2025/11/16 09:48:10 Jarek Potiuk wrote:
> > >>> I agree with TP wholeheartedly. The basic reason why XCom is deleted
> when
> > >>> restarting is to maintain idempotency principles. And if we allow
> XCom to
> > >>> be used to break idempotency (that's basically what state per task is
> > >>> about) - then XCom will stop serving its purpose.
> > >>>
> > >>> And of course - we are in the new "world" where we are not only
> supporting
> > >>> idempotent tasks, Various optimisations and different kinds of
> workloads
> > >>> require breaking the "old" idempotency rules we used to have when
> Airflow
> > >>> was used mainly for ETL. And deletion of XCom state was also
> questioned
> > >>> back then because people **wanted** to use Xcom in other ways. But
> we held
> > >>> strongly and I think that was a good choice.
> > >>>
> > >>> And while repurposing XCom to do "something" else might seem like a
> good
> > >>> idea - even for Apple, because they could internally agree to some
> > >>> convention and use it as "solution". But when you look at Airflow as
> a
> > >>> product, repurposing XCome to also do something else (i.e. storing
> state)
> > >>> seems a bit "lazy" and "short-cut-y".
> > >>>
> > >>> What does it save if you do it this way? Few things:
> > >>>
> > >>> * not having to do database migration to implement new feature
> > >>> * avoiding having a clearly defined API where state can be stored for
> > >>> various purposes on different levels (Task Instance, Task, Task Group
> > >>> maybe, Dag, Team eventually)
> > >>> * avoiding to think and prepare for all the various use cases that
> people
> > >>> really would like to use it
> > >>> * avoiding to write the use-case documentation explaining how you
> can use
> > >>> state
> > >>> * avoiding to write all the test cases making sure that all those
> use cases
> > >>> are served way
> > >>> * not thinking too much about performance and security implications
> of
> > >>> those ("Xcom has it already sorted out, I am sure it's going to be
> fine")
> > >>>
> > >>> Yes, it can be done way faster this way. and I understand some
> commercial
> > >>> users could have chosen this way as a shortcut to handle a specific
> use
> > >>> case they had in mind. This is absolutely understandable, and this
> is what
> > >>> I would even expect a for-profit company to do to increase so-called
> > >>> "time-to-market" and start reaping the benefits of it faster.
> > >>>
> > >>> But should we do it in Airflow the same way ? We are not a for-profit
> > >>> company, time-to-market of such a feature is secondary, compared to
> the
> > >>> stability, maintainability and having a "product" vision.
> > >>> I consider all the above points as absolutely crucial properties of a
> > >>> "product" - which Airflow is. They might not be needed in a
> "solution", but
> > >>> having a good "product" - absolutely requires all those things,
> > >>>
> > >>> When we switched to Airflow 3, one of the ideas was to remove all
> the bad
> > >>> "solution-y" decisions we made in the past that slowed us down in
> general
> > >>> and - more importantly - turned us into (as Daniel used to say) into
> > >>> "back-compatibility engineers"
> > >>>
> > >>> Does it mean it will take longer and require more dedication and
> effort
> > >>> and discussions to agree on the scope ? Absolutely. Is this a bad
> thing? I
> > >>> don't think so.
> > >>>
> > >>> J.
> > >>>
> > >>>
> > >>> On Sun, Nov 16, 2025 at 9:43 AM Tzu-ping Chung via dev <
> > >>> [email protected]> wrote:
> > >>>
> > >>>> What is the motivation behind storing internal state in a task,
> instead of
> > >>>> splitting the logic on state boundaries into multiple tasks? That’s
> what
> > >>>> the task abstraction is supposed for, and you wouldn’t need to a
> separate
> > >>>> mechanism for that—regular XCom would just work.
> > >>>>
> > >>>> While storing state is a legitimate use case, I feel this
> particular idea
> > >>>> would have a more negative impact on encouraging people to do too
> many
> > >>>> things in one task. I’d even argue the examples given in the
> Confluence
> > >>>> document are already so.
> > >>>>
> > >>>> TP
> > >>>>
> > >>>>
> > >>>>> On 14 Nov 2025, at 08:32, Xiaodong Deng <[email protected]> wrote:
> > >>>>>
> > >>>>> Hi folks!
> > >>>>>
> > >>>>> We would like to propose a new feature in Airflow, a boolean
> > >>>>> parameter  "persist_xcom_through_retry" Parameter in all Airflow
> > >>>> Operators.
> > >>>>> Our team added this feature in our internal fork a few years back,
> and it
> > >>>>> has been benefiting our users extensively.
> > >>>>>
> > >>>>> *I have created an AIP
> > >>>>> at
> > >>>>
> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=399278333
> > >>>>> <
> > >>>>
> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=399278333
> > >>>>> *.
> > >>>>> Below is a summary (in the complete AIP, we have a more detailed
> problem
> > >>>>> statement and quite a few interesting use-case examples):
> > >>>>>
> > >>>>>
> > >>>>>
> > >>>>>
> > >>>>> *Traditionally, XCom is defined as “a mechanism that lets Tasks
> talk to
> > >>>>> each other”. However, XCom also has the capacity and potential to
> help
> > >>>>> persist and manage task state within a task itself.Currently,
> Apache
> > >>>>> Airflow automatically clears a task instance’s XCom data when it is
> > >>>>> retried. This behavior, while ensuring clean state for retry
> attempts,
> > >>>>> creates limitations:*
> > >>>>>
> > >>>>>   - *Loss of Internal Progress: Tasks that have internal
> checkpointing or
> > >>>>>   progress tracking lose all intermediate state on retry, forcing
> restart
> > >>>>>   from the beginning.*
> > >>>>>   - *Resource State Loss: Tasks cannot maintain state about
> allocated
> > >>>>>   resources (compute instances, downstream job IDs, etc.) across
> retry
> > >>>>>   attempts, leading to redundant expensive setup operations.*
> > >>>>>   - *No Recovery/Resume Capability: There's no way for tasks to
> resume
> > >>>>>   from internal checkpoints when transient failures occur during
> > >>>>>   long-running atomicoperations.*
> > >>>>>   - *Poor User Experience: users must implement external state
> management
> > >>>>>   systems to work around this limitation, adding complexity to DAG
> > >>>> authoring.*
> > >>>>>
> > >>>>> *This proposal aims at extending the capacity of XCom by allowing
> > >>>>> persisting a Task Instance’s XCom through its retries, enabling
> users to
> > >>>>> build more resilient and efficient pipelines. This is particularly
> useful
> > >>>>> for the type of tasks which are atomic (so one such task cannot be
> split
> > >>>>> into multiple tasks) and need to manage internal state or
> checkpoints. *
> > >>>>>
> > >>>>>
> > >>>>> We look forward to your feedback and thoughts. Thanks!
> > >>>>>
> > >>>>>
> > >>>>> Regards,
> > >>>>>
> > >>>>> XD
> > >>>>
> > >>>>
> ---------------------------------------------------------------------
> > >>>> To unsubscribe, e-mail: [email protected]
> > >>>> For additional commands, e-mail: [email protected]
> > >>>>
> > >>>>
> > >> ---------------------------------------------------------------------
> > >> To unsubscribe, e-mail: [email protected]
> > >> For additional commands, e-mail: [email protected]
> > >>
> > >
> > > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: [email protected]
> > > For additional commands, e-mail: [email protected]
> > >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: [email protected]
> > For additional commands, e-mail: [email protected]
> >
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
>
>

Re: AIP - Add "persist_xcom_through_retry" Parameter to Airflow Operators

Reply via email to