Apologies for the late response folks while I had oncall shifts. Catching up 
here and will respond to each comment in order.



—



Re: https://lists.apache.org/thread/j02owr28cjw7zyyrp938fqt69nbmyxy4 from Jens 
Scheffler

Hi Jens,

Thanks for the suggestion! I completely agree that following the formal AIP 
process is the right approach.

I've been trying to create the AIPs on the Confluence wiki, but I'm running 
into permission issues. When I click the "Create new AIP" button on the AIP 
page, I get a "Sorry, you don't have permission to create content" error.

I've tried following the exact step listed to create ASF confluence account 
however neither has EDIT access granted under the AIRFLOW space, created two 
accounts (stewang and stefwang) to rule out any account-specific issues, but 
both accounts have the same problem. Would really appreciate some expertise in 
this area to help point me to who we should contact to get the appropriate 
permissions, or is there a specific access request process I should follow? - 
Or if someone else with edit access could help copy paste the google doc 
content into Confluence for comments, thanks a lot!

I’ll try to contact ASF infra support in the mean time, and will work on 
migrate the Google Docs to Confluence once I have access.

Thanks, Stefan


> On Nov 15, 2025, at 6:27 AM, Christos Bisias <[email protected]> wrote:
> 
> Hi Stefan,
> 
> Thank you for the work! Very well organised and easy to follow docs.
> 
> I have been thinking about infrastructure retries for a while now. Also, I
> had a few discussions at the Airflow Summit last month and I know that
> others are interested as well.
> 
> It looks to me too, that this will be split into multiple PRs but if there
> is a code POC, I would like to take a look.
> 
> Regards,
> Christos
> 
> On Fri, Nov 14, 2025 at 11:53 PM Jarek Potiuk <[email protected]> wrote:
> 
>> Also something we discussed off-line: I think the scope of it is quite
>> "huge" - but there are small and incremental improvements, that might not
>> even require AIP that can be implemented as PRs., I think it's great to
>> keep "big hairy vision" in head (like I did several years ago when I
>> proposed a "small" improvement in our dependency management that took about
>> 4 years to get to the stage I thought it would take a few weeks.
>> 
>> Getting incremental improvements and showing the dedication, merit and
>> consistent pattern of improvements is a key to get - eventually - big and
>> "world-changing" changes.
>> 
>> J.
>> 
>> 
>> On Fri, Nov 14, 2025 at 10:32 PM Jens Scheffler <[email protected]>
>> wrote:
>> 
>>> Hi Stefan,
>>> 
>>> thanks for dropping the proposals!
>>> 
>>> I'd propose to store the documents in cWiki and open them formally in
>>> there as AIP proposal as then it is sollowing the AIP process.
>>> 
>>> See
>>> 
>>> 
>> https://cwiki.apache.org/confluence/display/AIRFLOW/Airflow+Improvement+Proposals
>>> 
>>> Jens
>>> 
>>> On 11/14/25 12:35, Stefan Wang wrote:
>>>> Hi Airflow Community,
>>>> 
>>>> I'm excited to share two complementary proposals that address critical
>>> reliability challenges in Airflow, particularly around infrastructure
>>> disruptions and task resilience. These proposals build on insights from
>>> managing one of the larger Airflow deployments (20k+ DAGs, 100k+ daily
>> task
>>> executions per cluster).
>>>> 
>>>> Proposals
>>>> 
>>>> 1. Infrastructure-Aware Task Execution and Context Propagation
>>>> 
>>>> 
>>> 
>> https://docs.google.com/document/d/1BAOJTAPfWK93JnN6LQrISo8IqDiE7LpnfG2Q42fnn7M
>>>> 
>>>> 2. Resumable Operators for Disruption Readiness
>>>> 
>>>> 
>>> 
>> https://docs.google.com/document/d/1XPbCfuTVhyiq12tFxbyQrX_kQqrDqMo5t7M789MG4GI
>>>> 
>>>> What We're Solving
>>>> 
>>>> Infrastructure failures consume user retries - Pod evictions shouldn't
>>> count against application retry budgets
>>>> Wasted computation - Worker crashes shouldn't restart healthy 3-hour
>>> Databricks jobs from zero
>>>> How
>>>> 
>>>> Execution Context: Distinguish infrastructure vs application failures
>>> for smarter retry handling
>>>> Resumable Operators: Checkpoint and reconnect to external jobs after
>>> disruptions (follows deferral pattern)
>>>> These approaches have significantly improved reliability and user
>>> experience, and reduced wasted costs in our production environment.
>>>> 
>>>> Looking forward to your feedback on both the problems we're addressing
>>> and the proposed solutions. Both proposals are fully backward compatible
>>> and follow existing Airflow patterns.
>>>> 
>>>> Happy to answer any questions or dive deeper into implementation
>> details.
>>>> 
>>>> Best,
>>>> 
>>>> Stefan Wang
>>>> 
>>>> 
>>>> 
>>> 
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: [email protected]
>>> For additional commands, e-mail: [email protected]
>>> 
>>> 
>> 

Reply via email to