Hi Airflow Community,

I'm excited to share two complementary proposals that address critical 
reliability challenges in Airflow, particularly around infrastructure 
disruptions and task resilience. These proposals build on insights from 
managing one of the larger Airflow deployments (20k+ DAGs, 100k+ daily task 
executions per cluster).

Proposals

1. Infrastructure-Aware Task Execution and Context Propagation

https://docs.google.com/document/d/1BAOJTAPfWK93JnN6LQrISo8IqDiE7LpnfG2Q42fnn7M

2. Resumable Operators for Disruption Readiness

https://docs.google.com/document/d/1XPbCfuTVhyiq12tFxbyQrX_kQqrDqMo5t7M789MG4GI

What We're Solving

Infrastructure failures consume user retries - Pod evictions shouldn't count 
against application retry budgets
Wasted computation - Worker crashes shouldn't restart healthy 3-hour Databricks 
jobs from zero
How

Execution Context: Distinguish infrastructure vs application failures for 
smarter retry handling
Resumable Operators: Checkpoint and reconnect to external jobs after 
disruptions (follows deferral pattern)
These approaches have significantly improved reliability and user experience, 
and reduced wasted costs in our production environment.

Looking forward to your feedback on both the problems we're addressing and the 
proposed solutions. Both proposals are fully backward compatible and follow 
existing Airflow patterns.

Happy to answer any questions or dive deeper into implementation details. 

Best,

Stefan Wang


Reply via email to