Hi Daniel! I totally get you.
It's somehow like we are decorating a big house, which is already full of fancy things but somehow a little disorganized. This AIP here is like suggesting putting one more bouquet of flowers on the dining table: yes, it may smell & look nice, and seems not wrong at all given we have been putting flowers here & there all the time. That's why I proceeded to share this AIP here. But I totally understand the concern that we may want to unify the interface/the style/the recommendation/...: a unified decoration style of this house. I still think the feature this AIP suggested here is necessary. I look forward to the following-up discussions to make it happen via other better approach and the right interface. Thanks! XD On 2025/11/19 22:38:08 Daniel Standish via dev wrote: > Yeah XD I was just saying that the config feels a little bit like that. > > But my main point was to try to suggest that we evaluate this based on the > interface design, and not whether it allows for non-idempotent tasks > because: > > - Airflow already allows such things > - and the concept is somewhat fraught anyway > - and I don't think Airflow should be so opinionated about it to go out > of its way to forbid features needed for real world workflow > > > Just wanted to suggest we focus on, is this the right interface, or a good > interface, or a good enough interface, etc. > > Hopefully that did not get lost in my other comments. > > Interface design is hard. > > > > > > On Wed, Nov 19, 2025 at 1:38 PM Xiaodong Deng <[email protected]> wrote: > > > Hi folks, > > > > Thanks a lot for all the valuable inputs. > > > > Regarding what @Daniel mentioned, I get what you shared. It's just similar > > to the concern regarding Configurations: new minor varieties keep stemming > > from the main ones (like the `Pool.include_deferred` example you > > mentioned). I fully get you. Does this also mean we should start to have a > > concrete detailed guideline how we should consider new features? That's > > possibly worth considering. > > > > Regarding what @TP shared: > > - For use case "External Job Tracking and Polling", yes, the intuition > > would be an operator + a sensor. In the Confluence doc, we had a line to > > explain "why not separate job Triggering and Polling into two steps". May > > or may not be a solid reason. > > - For other points, you mentioned the better option may be to "making it a > > separate task". That applies most of the time, I agree, while there can be > > exceptions (that's why most of us are agreeing State can be a useful > > feature here, even if we may be proposing different approaches). > > - In the end I feel there seems existing a recommended way/philosophy of > > using Airflow ("flexibility" vs. "recommended practice" vs. ...), while > > it's not clearly summarized anywhere. That's possibly another thing worth > > considering. > > > > Given all the valuable inputs from the folks, I will withdraw this > > proposal for now. I'm happy to discuss with the folks on the alternative > > approach. > > > > Thanks again! > > > > > > Regards, > > XD > > > > > > On 2025/11/19 06:28:49 Jens Scheffler wrote: > > > Hi, > > > > > > would add (6) as use case as I made it in the Confluence as comment and > > > TP highlighted: Add try number and keep history for seeing differences > > > between runs (as admin for sanity check/history after dag code was e.g. > > > patched - might be a dowstream task was not re-run and was depending on > > > an older XCom ... so that would help in case of troubleshooting. > > > > > > But in (6) NOT as to have logic based on try_number as this would be > > > another purpose in my view. > > > > > > So in this case I think the discussion is valuable and some extension in > > > all the listed use cases makes sense to me! > > > > > > Jens > > > > > > On 11/19/25 07:07, Tzu-ping Chung via dev wrote: > > > > What I feel is, while it is fine to have more than one way to do a > > thing, some of the examples do not sufficiently discuss why existing > > features are not suitable for the use case. This context is important since > > it would affect how we implement the new feature to sufficiently > > distinguish it from existing ones, so it is easier to make the correct > > decision when you are choosing between features to achieve a goal. It is > > also a good chance for us to take a look at enhancing other existing > > features so they cover more use cases and work better together. > > > > > > > > I’ll try to break down each use case in the appendix. To be clear, I > > can think of some possibilities for each case why a new feature is > > preferred, but the problem is the document should sufficiently explore and > > discuss existing solutions. > > > > > > > > 1: Large Dataset Processing with Checkpoints > > > > > > > > It is unclear from the example how the use case cannot be satisfied by > > dynamic task mapping: > > > > > > > > @task > > > > def process_record(record): ... > > > > > > > > @task(trigger_rule="always") > > > > def summary(results): ... > > > > > > > > results = process_record.expand(record=get_records_to_process()) > > > > summary(results) > > > > > > > > 2: External Job Tracking and Polling > > > > > > > > This looks like a use case for sensors to me. > > > > > > > > 3: More Efficient API Integration > > > > > > > > Why does make_api_calls need to be in the same task? All existing > > patterns in Airflow point to making it a separate task. > > > > > > > > 4: Resource Management and Cleanup > > > > > > > > Isn’t this what teardown tasks are for? > > > > > > > > 5: Adaptive Processing with Learning > > > > > > > > This is the use case that I feel the proposal is most useful for. > > However, it can also be satisfied by Variable, or the state persistence > > mechanism mentioned by Ash. > > > > > > > > In some ways, the three are really the same thing—a way to keep > > context—except they have different scopes. Variable has the global (to the > > Airflow instance) scope, XCom the task runner process scope (almost task > > instance scope but not quite since it’s cleared for retry). StateVariable > > is also global as currently proposed, but from the listed use cases, it is > > arguably more suitable to be task- or dag-scoped (not to be confused to > > being scoped to a task instance or dag run). > > > > > > > > Back to the proposal at hand, the way I understand > > persist_xcom_through_retry is it essentially switches all XComs pushed in > > the task from being scoped by the task instance *try* to the task instance > > (across all tries). I think the idea itself is worth having, and having a > > task-level flag may be a good way to expose it to users. However, I feel > > there are some choices we can still discuss on what the feature actually > > means beyond having a flag that does one specific thing internally. > > > > > > > > For example, perhaps we should remodel XComModel to include a > > try_number, and allow it to be scoped both against a ti or a ti try? > > Potentially even more choices such as task-scoped across runs, or globally > > by unifying Variable? There are many open questions from my point of view, > > and again, I feel the proposal document should discuss the use cases in > > more detail to pin down the specifics, instead of leaving things out for > > interpretation. > > > > > > > > TP > > > > > > > > > > > >> On 19 Nov 2025, at 06:20, Xiaodong Deng <[email protected]> wrote: > > > >> > > > >> Thanks for your valuable feedback, folks. > > > >> > > > >> Hi @TP, > > > >> > > > >> There are cases where breaking down to multiple tasks is not feasible > > or not the best option. For example, the use case 1 I have shared in the > > Confluence doc appendix. > > > >> > > > >> There are also examples where splitting into multiple tasks may seem > > make sense but may cause down-side effect. In use case 2 and 4 in the > > Confluence doc appendix, I shared why we do it in a single task instead of > > splitting them into two tasks. > > > >> > > > >> Some tasks are simply atomic. > > > >> > > > >> > > > >> Hi @Jarek, > > > >> > > > >> I'm glad we are talking about idempotency. That's exactly why > > sometimes we cannot break down some tasks. In the "Problem Examples" > > section in the Confluence doc, I covered that at some extent. > > > >> > > > >> Would love to discuss more on this, or learn from you for any > > alternative solutions which can become available to Airflow users in a > > timely manner. > > > >> > > > >> Many thanks! > > > >> > > > >> > > > >> Regards, > > > >> XD > > > >> > > > >> On 2025/11/16 09:48:10 Jarek Potiuk wrote: > > > >>> I agree with TP wholeheartedly. The basic reason why XCom is deleted > > when > > > >>> restarting is to maintain idempotency principles. And if we allow > > XCom to > > > >>> be used to break idempotency (that's basically what state per task is > > > >>> about) - then XCom will stop serving its purpose. > > > >>> > > > >>> And of course - we are in the new "world" where we are not only > > supporting > > > >>> idempotent tasks, Various optimisations and different kinds of > > workloads > > > >>> require breaking the "old" idempotency rules we used to have when > > Airflow > > > >>> was used mainly for ETL. And deletion of XCom state was also > > questioned > > > >>> back then because people **wanted** to use Xcom in other ways. But > > we held > > > >>> strongly and I think that was a good choice. > > > >>> > > > >>> And while repurposing XCom to do "something" else might seem like a > > good > > > >>> idea - even for Apple, because they could internally agree to some > > > >>> convention and use it as "solution". But when you look at Airflow as > > a > > > >>> product, repurposing XCome to also do something else (i.e. storing > > state) > > > >>> seems a bit "lazy" and "short-cut-y". > > > >>> > > > >>> What does it save if you do it this way? Few things: > > > >>> > > > >>> * not having to do database migration to implement new feature > > > >>> * avoiding having a clearly defined API where state can be stored for > > > >>> various purposes on different levels (Task Instance, Task, Task Group > > > >>> maybe, Dag, Team eventually) > > > >>> * avoiding to think and prepare for all the various use cases that > > people > > > >>> really would like to use it > > > >>> * avoiding to write the use-case documentation explaining how you > > can use > > > >>> state > > > >>> * avoiding to write all the test cases making sure that all those > > use cases > > > >>> are served way > > > >>> * not thinking too much about performance and security implications > > of > > > >>> those ("Xcom has it already sorted out, I am sure it's going to be > > fine") > > > >>> > > > >>> Yes, it can be done way faster this way. and I understand some > > commercial > > > >>> users could have chosen this way as a shortcut to handle a specific > > use > > > >>> case they had in mind. This is absolutely understandable, and this > > is what > > > >>> I would even expect a for-profit company to do to increase so-called > > > >>> "time-to-market" and start reaping the benefits of it faster. > > > >>> > > > >>> But should we do it in Airflow the same way ? We are not a for-profit > > > >>> company, time-to-market of such a feature is secondary, compared to > > the > > > >>> stability, maintainability and having a "product" vision. > > > >>> I consider all the above points as absolutely crucial properties of a > > > >>> "product" - which Airflow is. They might not be needed in a > > "solution", but > > > >>> having a good "product" - absolutely requires all those things, > > > >>> > > > >>> When we switched to Airflow 3, one of the ideas was to remove all > > the bad > > > >>> "solution-y" decisions we made in the past that slowed us down in > > general > > > >>> and - more importantly - turned us into (as Daniel used to say) into > > > >>> "back-compatibility engineers" > > > >>> > > > >>> Does it mean it will take longer and require more dedication and > > effort > > > >>> and discussions to agree on the scope ? Absolutely. Is this a bad > > thing? I > > > >>> don't think so. > > > >>> > > > >>> J. > > > >>> > > > >>> > > > >>> On Sun, Nov 16, 2025 at 9:43 AM Tzu-ping Chung via dev < > > > >>> [email protected]> wrote: > > > >>> > > > >>>> What is the motivation behind storing internal state in a task, > > instead of > > > >>>> splitting the logic on state boundaries into multiple tasks? That’s > > what > > > >>>> the task abstraction is supposed for, and you wouldn’t need to a > > separate > > > >>>> mechanism for that—regular XCom would just work. > > > >>>> > > > >>>> While storing state is a legitimate use case, I feel this > > particular idea > > > >>>> would have a more negative impact on encouraging people to do too > > many > > > >>>> things in one task. I’d even argue the examples given in the > > Confluence > > > >>>> document are already so. > > > >>>> > > > >>>> TP > > > >>>> > > > >>>> > > > >>>>> On 14 Nov 2025, at 08:32, Xiaodong Deng <[email protected]> wrote: > > > >>>>> > > > >>>>> Hi folks! > > > >>>>> > > > >>>>> We would like to propose a new feature in Airflow, a boolean > > > >>>>> parameter "persist_xcom_through_retry" Parameter in all Airflow > > > >>>> Operators. > > > >>>>> Our team added this feature in our internal fork a few years back, > > and it > > > >>>>> has been benefiting our users extensively. > > > >>>>> > > > >>>>> *I have created an AIP > > > >>>>> at > > > >>>> > > https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=399278333 > > > >>>>> < > > > >>>> > > https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=399278333 > > > >>>>> *. > > > >>>>> Below is a summary (in the complete AIP, we have a more detailed > > problem > > > >>>>> statement and quite a few interesting use-case examples): > > > >>>>> > > > >>>>> > > > >>>>> > > > >>>>> > > > >>>>> *Traditionally, XCom is defined as “a mechanism that lets Tasks > > talk to > > > >>>>> each other”. However, XCom also has the capacity and potential to > > help > > > >>>>> persist and manage task state within a task itself.Currently, > > Apache > > > >>>>> Airflow automatically clears a task instance’s XCom data when it is > > > >>>>> retried. This behavior, while ensuring clean state for retry > > attempts, > > > >>>>> creates limitations:* > > > >>>>> > > > >>>>> - *Loss of Internal Progress: Tasks that have internal > > checkpointing or > > > >>>>> progress tracking lose all intermediate state on retry, forcing > > restart > > > >>>>> from the beginning.* > > > >>>>> - *Resource State Loss: Tasks cannot maintain state about > > allocated > > > >>>>> resources (compute instances, downstream job IDs, etc.) across > > retry > > > >>>>> attempts, leading to redundant expensive setup operations.* > > > >>>>> - *No Recovery/Resume Capability: There's no way for tasks to > > resume > > > >>>>> from internal checkpoints when transient failures occur during > > > >>>>> long-running atomicoperations.* > > > >>>>> - *Poor User Experience: users must implement external state > > management > > > >>>>> systems to work around this limitation, adding complexity to DAG > > > >>>> authoring.* > > > >>>>> > > > >>>>> *This proposal aims at extending the capacity of XCom by allowing > > > >>>>> persisting a Task Instance’s XCom through its retries, enabling > > users to > > > >>>>> build more resilient and efficient pipelines. This is particularly > > useful > > > >>>>> for the type of tasks which are atomic (so one such task cannot be > > split > > > >>>>> into multiple tasks) and need to manage internal state or > > checkpoints. * > > > >>>>> > > > >>>>> > > > >>>>> We look forward to your feedback and thoughts. Thanks! > > > >>>>> > > > >>>>> > > > >>>>> Regards, > > > >>>>> > > > >>>>> XD > > > >>>> > > > >>>> > > --------------------------------------------------------------------- > > > >>>> To unsubscribe, e-mail: [email protected] > > > >>>> For additional commands, e-mail: [email protected] > > > >>>> > > > >>>> > > > >> --------------------------------------------------------------------- > > > >> To unsubscribe, e-mail: [email protected] > > > >> For additional commands, e-mail: [email protected] > > > >> > > > > > > > > --------------------------------------------------------------------- > > > > To unsubscribe, e-mail: [email protected] > > > > For additional commands, e-mail: [email protected] > > > > > > > > > > --------------------------------------------------------------------- > > > To unsubscribe, e-mail: [email protected] > > > For additional commands, e-mail: [email protected] > > > > > > > > > > --------------------------------------------------------------------- > > To unsubscribe, e-mail: [email protected] > > For additional commands, e-mail: [email protected] > > > > > --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
