Hi all, We currently run Airflow as a Deployment in a kubernetes cluster. We also use a variant of KubernetesOperator to run our DAGs.
We are investigating how to best make Airflow fault-tolerant, in part, due to investigating the use of preemptible vms [1]. *Has there been much discussion about about how to deploy Airflow in a fault-tolerant way? Are there any best practices? Ideally we'd like our kubernetes-hosted Airflow to support rolling updates for Docker image updates and also recover from components (worker, scheduler, web) going down temporarily, including when DAGs are in flight. * Any advice, ideas and/or feedback appreciated! [1] https://cloud.google.com/kubernetes-engine/docs/how-to/preemptible-vms