Hi Jens,

Thank you so much for the help and for being so supportive — it’s working for 
me now!

Really appreciate you stepping in.

Best,
Stefan


> On Nov 30, 2025, at 12:27 AM, Jens Scheffler <[email protected]> wrote:
> 
> As PMC we are space owners, added your permissions for the user stefwang to 
> the Airflow space. Hope now it is working.
> 
> On 11/30/25 04:54, Stefan Wang wrote:
>> Apologies for the late response folks while I had oncall shifts. Catching up 
>> here and will respond to each comment in order.
>> 
>> 
>> 
>> —
>> 
>> 
>> 
>> Re: https://lists.apache.org/thread/j02owr28cjw7zyyrp938fqt69nbmyxy4 from 
>> Jens Scheffler
>> 
>> Hi Jens,
>> 
>> Thanks for the suggestion! I completely agree that following the formal AIP 
>> process is the right approach.
>> 
>> I've been trying to create the AIPs on the Confluence wiki, but I'm running 
>> into permission issues. When I click the "Create new AIP" button on the AIP 
>> page, I get a "Sorry, you don't have permission to create content" error.
>> 
>> I've tried following the exact step listed to create ASF confluence account 
>> however neither has EDIT access granted under the AIRFLOW space, created two 
>> accounts (stewang and stefwang) to rule out any account-specific issues, but 
>> both accounts have the same problem. Would really appreciate some expertise 
>> in this area to help point me to who we should contact to get the 
>> appropriate permissions, or is there a specific access request process I 
>> should follow? - Or if someone else with edit access could help copy paste 
>> the google doc content into Confluence for comments, thanks a lot!
>> 
>> I’ll try to contact ASF infra support in the mean time, and will work on 
>> migrate the Google Docs to Confluence once I have access.
>> 
>> Thanks, Stefan
>> 
>> 
>>> On Nov 15, 2025, at 6:27 AM, Christos Bisias <[email protected]> wrote:
>>> 
>>> Hi Stefan,
>>> 
>>> Thank you for the work! Very well organised and easy to follow docs.
>>> 
>>> I have been thinking about infrastructure retries for a while now. Also, I
>>> had a few discussions at the Airflow Summit last month and I know that
>>> others are interested as well.
>>> 
>>> It looks to me too, that this will be split into multiple PRs but if there
>>> is a code POC, I would like to take a look.
>>> 
>>> Regards,
>>> Christos
>>> 
>>> On Fri, Nov 14, 2025 at 11:53 PM Jarek Potiuk <[email protected]> wrote:
>>> 
>>>> Also something we discussed off-line: I think the scope of it is quite
>>>> "huge" - but there are small and incremental improvements, that might not
>>>> even require AIP that can be implemented as PRs., I think it's great to
>>>> keep "big hairy vision" in head (like I did several years ago when I
>>>> proposed a "small" improvement in our dependency management that took about
>>>> 4 years to get to the stage I thought it would take a few weeks.
>>>> 
>>>> Getting incremental improvements and showing the dedication, merit and
>>>> consistent pattern of improvements is a key to get - eventually - big and
>>>> "world-changing" changes.
>>>> 
>>>> J.
>>>> 
>>>> 
>>>> On Fri, Nov 14, 2025 at 10:32 PM Jens Scheffler <[email protected]>
>>>> wrote:
>>>> 
>>>>> Hi Stefan,
>>>>> 
>>>>> thanks for dropping the proposals!
>>>>> 
>>>>> I'd propose to store the documents in cWiki and open them formally in
>>>>> there as AIP proposal as then it is sollowing the AIP process.
>>>>> 
>>>>> See
>>>>> 
>>>>> 
>>>> https://cwiki.apache.org/confluence/display/AIRFLOW/Airflow+Improvement+Proposals
>>>>> Jens
>>>>> 
>>>>> On 11/14/25 12:35, Stefan Wang wrote:
>>>>>> Hi Airflow Community,
>>>>>> 
>>>>>> I'm excited to share two complementary proposals that address critical
>>>>> reliability challenges in Airflow, particularly around infrastructure
>>>>> disruptions and task resilience. These proposals build on insights from
>>>>> managing one of the larger Airflow deployments (20k+ DAGs, 100k+ daily
>>>> task
>>>>> executions per cluster).
>>>>>> Proposals
>>>>>> 
>>>>>> 1. Infrastructure-Aware Task Execution and Context Propagation
>>>>>> 
>>>>>> 
>>>> https://docs.google.com/document/d/1BAOJTAPfWK93JnN6LQrISo8IqDiE7LpnfG2Q42fnn7M
>>>>>> 2. Resumable Operators for Disruption Readiness
>>>>>> 
>>>>>> 
>>>> https://docs.google.com/document/d/1XPbCfuTVhyiq12tFxbyQrX_kQqrDqMo5t7M789MG4GI
>>>>>> What We're Solving
>>>>>> 
>>>>>> Infrastructure failures consume user retries - Pod evictions shouldn't
>>>>> count against application retry budgets
>>>>>> Wasted computation - Worker crashes shouldn't restart healthy 3-hour
>>>>> Databricks jobs from zero
>>>>>> How
>>>>>> 
>>>>>> Execution Context: Distinguish infrastructure vs application failures
>>>>> for smarter retry handling
>>>>>> Resumable Operators: Checkpoint and reconnect to external jobs after
>>>>> disruptions (follows deferral pattern)
>>>>>> These approaches have significantly improved reliability and user
>>>>> experience, and reduced wasted costs in our production environment.
>>>>>> Looking forward to your feedback on both the problems we're addressing
>>>>> and the proposed solutions. Both proposals are fully backward compatible
>>>>> and follow existing Airflow patterns.
>>>>>> Happy to answer any questions or dive deeper into implementation
>>>> details.
>>>>>> Best,
>>>>>> 
>>>>>> Stefan Wang
>>>>>> 
>>>>>> 
>>>>>> 
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail: [email protected]
>>>>> For additional commands, e-mail: [email protected]
>>>>> 
>>>>> 
>> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
> 

Reply via email to