This seems super useful to have. We would like to use to autoscale our infrastructure.
I've been eyeing this JIRA ticket for this feature: https://issues.apache.org/jira/browse/AIRFLOW-1123 Has anyone else investigated this or know if this is on the roadmap? Seems like this can be dealt with in models.TaskInstance._run_raw_task.signal_handler to raise a different exception from AirflowException (models.TaskInstance.handle_failure) so we don't record the task as failed. Then we can return a specific exit code that the CeleryExecutor task can look for to automatically call retry. Am i missing anything else here? On Tue, Feb 13, 2018 at 3:28 PM, Daniel Imberman <daniel.imber...@gmail.com> wrote: > Hi Kevin, > > I can answer in more detail in about 1 hour but I'll give a few quick > points: > > 1. We decided that jobs are pretty risky when it comes to airflow > deployments (imagine a pod that launches a spark job retrying infinitely), > however airflow allows you to define a retry policy > > > 2. We specifically attempted to prevent dagfailures due to pods dying, but > I think we didn't account for pods dying mid tasks (or just assumed people > would just restart). I think this can be a PR against the executor > > On Tue, Feb 13, 2018, 10:53 AM Kevin Lam <ke...@fathomhealth.co> wrote: > > > Hi all, > > > > My team and I have been experimenting with Airflow and Kubernetes, and > > there has been a lot of activity recently with the Kubernetes Executor so > > hopefully someone can help us out. > > > > Specifically, we are using our own variant of the kubernetes executor to > > run some pods on pre-emptible VMs on GKE ( > > https://cloud.google.com/kubernetes-engine/docs/concepts/preemptible-vm > ), > > and were wondering if anyone had an advice regarding how to handle > > pre-emptions of nodes in a graceful way. > > > > Currently, if a node gets pre-empted and is removed our pod dies causing > a > > corresponding airflow task to fail, but in such cases we'd really like > the > > pod to be recreated and the task go continue on. At the same time we want > > other 'normal' failures to cause the airflow task to fail. > > > > One idea is to use jobs instead of pods, but if I recall correctly there > > was already a bunch of discussion on this topic for the apache Kube > > Executor, and in the end pods were chosen. > > > > Does anyone have any ideas about how to work with pre-emptible > > VMs+GKE+Airflow? Any help is appreciated! > > > > Thanks, > > Kevin > > >