Hi Airflow Community, I'm excited to share two complementary proposals that address critical reliability challenges in Airflow, particularly around infrastructure disruptions and task resilience. These proposals build on insights from managing one of the larger Airflow deployments (20k+ DAGs, 100k+ daily task executions per cluster).
Proposals 1. Infrastructure-Aware Task Execution and Context Propagation https://docs.google.com/document/d/1BAOJTAPfWK93JnN6LQrISo8IqDiE7LpnfG2Q42fnn7M 2. Resumable Operators for Disruption Readiness https://docs.google.com/document/d/1XPbCfuTVhyiq12tFxbyQrX_kQqrDqMo5t7M789MG4GI What We're Solving Infrastructure failures consume user retries - Pod evictions shouldn't count against application retry budgets Wasted computation - Worker crashes shouldn't restart healthy 3-hour Databricks jobs from zero How Execution Context: Distinguish infrastructure vs application failures for smarter retry handling Resumable Operators: Checkpoint and reconnect to external jobs after disruptions (follows deferral pattern) These approaches have significantly improved reliability and user experience, and reduced wasted costs in our production environment. Looking forward to your feedback on both the problems we're addressing and the proposed solutions. Both proposals are fully backward compatible and follow existing Airflow patterns. Happy to answer any questions or dive deeper into implementation details. Best, Stefan Wang
