Re: AIP - Add "persist_xcom_through_retry" Parameter to Airflow Operators

Xiaodong Deng Wed, 19 Nov 2025 13:38:25 -0800

Hi folks,

Thanks a lot for all the valuable inputs.


Regarding what @Daniel mentioned, I get what you shared. It's just similar to 
the concern regarding Configurations: new minor varieties keep stemming from 
the main ones (like the `Pool.include_deferred` example you mentioned). I fully 
get you. Does this also mean we should start to have a concrete detailed 
guideline how we should consider new features? That's possibly worth 
considering.

Regarding what @TP shared:
- For use case "External Job Tracking and Polling", yes, the intuition would be 
an operator + a sensor. In the Confluence doc, we had a line to explain "why 
not separate job Triggering and Polling into two steps". May or may not be a 
solid reason.
- For other points, you mentioned the better option may be to "making it a 
separate task". That applies most of the time, I agree, while there can be 
exceptions (that's why most of us are agreeing State can be a useful feature 
here, even if we may be proposing different approaches).
- In the end I feel there seems existing a recommended way/philosophy of using 
Airflow ("flexibility" vs. "recommended practice" vs. ...), while it's not 
clearly summarized anywhere. That's possibly another thing worth considering.

Given all the valuable inputs from the folks, I will withdraw this proposal for 
now. I'm happy to discuss with the folks on the alternative approach.

Thanks again!


Regards,
XD


On 2025/11/19 06:28:49 Jens Scheffler wrote:
> Hi,
> 
> would add (6) as use case as I made it in the Confluence as comment and 
> TP highlighted: Add try number and keep history for seeing differences 
> between runs (as admin for sanity check/history after dag code was e.g. 
> patched - might be a dowstream task was not re-run and was depending on 
> an older XCom ... so that would help in case of troubleshooting.
> 
> But in (6) NOT as to have logic based on try_number as this would be 
> another purpose in my view.
> 
> So in this case I think the discussion is valuable and some extension in 
> all the listed use cases makes sense to me!
> 
> Jens
> 
> On 11/19/25 07:07, Tzu-ping Chung via dev wrote:
> > What I feel is, while it is fine to have more than one way to do a thing, 
> > some of the examples do not sufficiently discuss why existing features are 
> > not suitable for the use case. This context is important since it would 
> > affect how we implement the new feature to sufficiently distinguish it from 
> > existing ones, so it is easier to make the correct decision when you are 
> > choosing between features to achieve a goal. It is also a good chance for 
> > us to take a look at enhancing other existing features so they cover more 
> > use cases and work better together.
> >
> > I’ll try to break down each use case in the appendix. To be clear, I can 
> > think of some possibilities for each case why a new feature is preferred, 
> > but the problem is the document should sufficiently explore and discuss 
> > existing solutions.
> >
> > 1: Large Dataset Processing with Checkpoints
> >
> > It is unclear from the example how the use case cannot be satisfied by 
> > dynamic task mapping:
> >
> >      @task
> >      def process_record(record): ...
> >
> >      @task(trigger_rule="always")
> >      def summary(results): ...
> >
> >      results = process_record.expand(record=get_records_to_process())
> >      summary(results)
> >
> > 2: External Job Tracking and Polling
> >
> > This looks like a use case for sensors to me.
> >
> > 3: More Efficient API Integration
> >
> > Why does make_api_calls need to be in the same task? All existing patterns 
> > in Airflow  point to making it a separate task.
> >
> > 4: Resource Management and Cleanup
> >
> > Isn’t this what teardown tasks are for?
> >
> > 5: Adaptive Processing with Learning
> >
> > This is the use case that I feel the proposal is most useful for. However, 
> > it can also be satisfied by Variable, or the state persistence mechanism 
> > mentioned by Ash.
> >
> > In some ways, the three are really the same thing—a way to keep 
> > context—except they have different scopes. Variable has the global (to the 
> > Airflow instance) scope, XCom the task runner process scope (almost task 
> > instance scope but not quite since it’s cleared for retry). StateVariable 
> > is also global as currently proposed, but from the listed use cases, it is 
> > arguably more suitable to be task- or dag-scoped (not to be confused to 
> > being scoped to a task instance or dag run).
> >
> > Back to the proposal at hand, the way I understand 
> > persist_xcom_through_retry is it essentially switches all XComs pushed in 
> > the task from being scoped by the task instance *try* to the task instance 
> > (across all tries). I think the idea itself is worth having, and having a 
> > task-level flag may be a good way to expose it to users. However, I feel 
> > there are some choices we can still discuss on what the feature actually 
> > means beyond having a flag that does one specific thing internally.
> >
> > For example, perhaps we should remodel XComModel to include a try_number, 
> > and allow it to be scoped both against a ti or a ti try? Potentially even 
> > more choices such as task-scoped across runs, or globally by unifying 
> > Variable? There are many open questions from my point of view, and again, I 
> > feel the proposal document should discuss the use cases in more detail to 
> > pin down the specifics, instead of leaving things out for interpretation.
> >
> > TP
> >
> >
> >> On 19 Nov 2025, at 06:20, Xiaodong Deng <[email protected]> wrote:
> >>
> >> Thanks for your valuable feedback, folks.
> >>
> >> Hi @TP,
> >>
> >> There are cases where breaking down to multiple tasks is not feasible or 
> >> not the best option. For example, the use case 1 I have shared in the 
> >> Confluence doc appendix.
> >>
> >> There are also examples where splitting into multiple tasks may seem make 
> >> sense but may cause down-side effect. In use case 2 and 4 in the 
> >> Confluence doc appendix, I shared why we do it in a single task instead of 
> >> splitting them into two tasks.
> >>
> >> Some tasks are simply atomic.
> >>
> >>
> >> Hi @Jarek,
> >>
> >> I'm glad we are talking about idempotency. That's exactly why sometimes we 
> >> cannot break down some tasks. In the "Problem Examples" section in the 
> >> Confluence doc, I covered that at some extent.
> >>
> >> Would love to discuss more on this, or learn from you for any alternative 
> >> solutions which can become available to Airflow users in a timely manner.
> >>
> >> Many thanks!
> >>
> >>
> >> Regards,
> >> XD
> >>
> >> On 2025/11/16 09:48:10 Jarek Potiuk wrote:
> >>> I agree with TP wholeheartedly. The basic reason why XCom is deleted when
> >>> restarting is to maintain idempotency principles. And if we allow XCom to
> >>> be used to break idempotency (that's basically what state per task is
> >>> about) - then XCom will stop serving its purpose.
> >>>
> >>> And of course - we are in the new "world" where we are not only supporting
> >>> idempotent tasks, Various optimisations and different kinds of workloads
> >>> require breaking the "old" idempotency rules we used to have when Airflow
> >>> was used mainly for ETL. And deletion of XCom state was also questioned
> >>> back then because people **wanted** to use Xcom in other ways. But we held
> >>> strongly and I think that was a good choice.
> >>>
> >>> And while repurposing XCom to do "something" else might seem like a good
> >>> idea - even for Apple, because they could internally agree to some
> >>> convention and use it as "solution". But when you look at Airflow as a
> >>> product, repurposing XCome to also do something else (i.e. storing state)
> >>> seems a bit "lazy" and "short-cut-y".
> >>>
> >>> What does it save if you do it this way? Few things:
> >>>
> >>> * not having to do database migration to implement new feature
> >>> * avoiding having a clearly defined API where state can be stored for
> >>> various purposes on different levels (Task Instance, Task, Task Group
> >>> maybe, Dag, Team eventually)
> >>> * avoiding to think and prepare for all the various use cases that people
> >>> really would like to use it
> >>> * avoiding to write the use-case documentation explaining how you can use
> >>> state
> >>> * avoiding to write all the test cases making sure that all those use 
> >>> cases
> >>> are served way
> >>> * not thinking too much about performance and security implications of
> >>> those ("Xcom has it already sorted out, I am sure it's going to be fine")
> >>>
> >>> Yes, it can be done way faster this way. and I understand some commercial
> >>> users could have chosen this way as a shortcut to handle a specific use
> >>> case they had in mind. This is absolutely understandable, and this is what
> >>> I would even expect a for-profit company to do to increase so-called
> >>> "time-to-market" and start reaping the benefits of it faster.
> >>>
> >>> But should we do it in Airflow the same way ? We are not a for-profit
> >>> company, time-to-market of such a feature is secondary, compared to the
> >>> stability, maintainability and having a "product" vision.
> >>> I consider all the above points as absolutely crucial properties of a
> >>> "product" - which Airflow is. They might not be needed in a "solution", 
> >>> but
> >>> having a good "product" - absolutely requires all those things,
> >>>
> >>> When we switched to Airflow 3, one of the ideas was to remove all the bad
> >>> "solution-y" decisions we made in the past that slowed us down in general
> >>> and - more importantly - turned us into (as Daniel used to say) into
> >>> "back-compatibility engineers"
> >>>
> >>> Does it mean it will take longer and require more dedication and effort
> >>> and discussions to agree on the scope ? Absolutely. Is this a bad thing? I
> >>> don't think so.
> >>>
> >>> J.
> >>>
> >>>
> >>> On Sun, Nov 16, 2025 at 9:43 AM Tzu-ping Chung via dev <
> >>> [email protected]> wrote:
> >>>
> >>>> What is the motivation behind storing internal state in a task, instead 
> >>>> of
> >>>> splitting the logic on state boundaries into multiple tasks? That’s what
> >>>> the task abstraction is supposed for, and you wouldn’t need to a separate
> >>>> mechanism for that—regular XCom would just work.
> >>>>
> >>>> While storing state is a legitimate use case, I feel this particular idea
> >>>> would have a more negative impact on encouraging people to do too many
> >>>> things in one task. I’d even argue the examples given in the Confluence
> >>>> document are already so.
> >>>>
> >>>> TP
> >>>>
> >>>>
> >>>>> On 14 Nov 2025, at 08:32, Xiaodong Deng <[email protected]> wrote:
> >>>>>
> >>>>> Hi folks!
> >>>>>
> >>>>> We would like to propose a new feature in Airflow, a boolean
> >>>>> parameter  "persist_xcom_through_retry" Parameter in all Airflow
> >>>> Operators.
> >>>>> Our team added this feature in our internal fork a few years back, and 
> >>>>> it
> >>>>> has been benefiting our users extensively.
> >>>>>
> >>>>> *I have created an AIP
> >>>>> at
> >>>> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=399278333
> >>>>> <
> >>>> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=399278333
> >>>>> *.
> >>>>> Below is a summary (in the complete AIP, we have a more detailed problem
> >>>>> statement and quite a few interesting use-case examples):
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>> *Traditionally, XCom is defined as “a mechanism that lets Tasks talk to
> >>>>> each other”. However, XCom also has the capacity and potential to help
> >>>>> persist and manage task state within a task itself.Currently, Apache
> >>>>> Airflow automatically clears a task instance’s XCom data when it is
> >>>>> retried. This behavior, while ensuring clean state for retry attempts,
> >>>>> creates limitations:*
> >>>>>
> >>>>>   - *Loss of Internal Progress: Tasks that have internal checkpointing 
> >>>>> or
> >>>>>   progress tracking lose all intermediate state on retry, forcing 
> >>>>> restart
> >>>>>   from the beginning.*
> >>>>>   - *Resource State Loss: Tasks cannot maintain state about allocated
> >>>>>   resources (compute instances, downstream job IDs, etc.) across retry
> >>>>>   attempts, leading to redundant expensive setup operations.*
> >>>>>   - *No Recovery/Resume Capability: There's no way for tasks to resume
> >>>>>   from internal checkpoints when transient failures occur during
> >>>>>   long-running atomicoperations.*
> >>>>>   - *Poor User Experience: users must implement external state 
> >>>>> management
> >>>>>   systems to work around this limitation, adding complexity to DAG
> >>>> authoring.*
> >>>>>
> >>>>> *This proposal aims at extending the capacity of XCom by allowing
> >>>>> persisting a Task Instance’s XCom through its retries, enabling users to
> >>>>> build more resilient and efficient pipelines. This is particularly 
> >>>>> useful
> >>>>> for the type of tasks which are atomic (so one such task cannot be split
> >>>>> into multiple tasks) and need to manage internal state or checkpoints. *
> >>>>>
> >>>>>
> >>>>> We look forward to your feedback and thoughts. Thanks!
> >>>>>
> >>>>>
> >>>>> Regards,
> >>>>>
> >>>>> XD
> >>>>
> >>>> ---------------------------------------------------------------------
> >>>> To unsubscribe, e-mail: [email protected]
> >>>> For additional commands, e-mail: [email protected]
> >>>>
> >>>>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: [email protected]
> >> For additional commands, e-mail: [email protected]
> >>
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: [email protected]
> > For additional commands, e-mail: [email protected]
> >
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
> 
> 

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: AIP - Add "persist_xcom_through_retry" Parameter to Airflow Operators

Reply via email to