Hi Kevin, I can answer in more detail in about 1 hour but I'll give a few quick points:
1. We decided that jobs are pretty risky when it comes to airflow deployments (imagine a pod that launches a spark job retrying infinitely), however airflow allows you to define a retry policy 2. We specifically attempted to prevent dagfailures due to pods dying, but I think we didn't account for pods dying mid tasks (or just assumed people would just restart). I think this can be a PR against the executor On Tue, Feb 13, 2018, 10:53 AM Kevin Lam <[email protected]> wrote: > Hi all, > > My team and I have been experimenting with Airflow and Kubernetes, and > there has been a lot of activity recently with the Kubernetes Executor so > hopefully someone can help us out. > > Specifically, we are using our own variant of the kubernetes executor to > run some pods on pre-emptible VMs on GKE ( > https://cloud.google.com/kubernetes-engine/docs/concepts/preemptible-vm), > and were wondering if anyone had an advice regarding how to handle > pre-emptions of nodes in a graceful way. > > Currently, if a node gets pre-empted and is removed our pod dies causing a > corresponding airflow task to fail, but in such cases we'd really like the > pod to be recreated and the task go continue on. At the same time we want > other 'normal' failures to cause the airflow task to fail. > > One idea is to use jobs instead of pods, but if I recall correctly there > was already a bunch of discussion on this topic for the apache Kube > Executor, and in the end pods were chosen. > > Does anyone have any ideas about how to work with pre-emptible > VMs+GKE+Airflow? Any help is appreciated! > > Thanks, > Kevin >
