This seems super useful to have. We would like to use to autoscale our
infrastructure.

I've been eyeing this JIRA ticket for this feature:
https://issues.apache.org/jira/browse/AIRFLOW-1123

Has anyone else investigated this or know if this is on the roadmap?

Seems like this can be dealt with in
models.TaskInstance._run_raw_task.signal_handler to raise a different
exception from AirflowException (models.TaskInstance.handle_failure) so we
don't record the task as failed. Then we can return a specific exit code
that the CeleryExecutor task can look for to automatically call retry. Am i
missing anything else here?




On Tue, Feb 13, 2018 at 3:28 PM, Daniel Imberman <daniel.imber...@gmail.com>
wrote:

> Hi Kevin,
>
> I can answer in more detail in about 1 hour but I'll give a few quick
> points:
>
> 1. We decided that jobs are pretty risky when it comes to airflow
> deployments (imagine a pod that launches a spark job retrying infinitely),
> however airflow allows you to define a retry policy
>
>
> 2. We specifically attempted to prevent dagfailures due to pods dying, but
> I think we didn't account for pods dying mid tasks (or just assumed people
> would just restart). I think this can be a PR against the executor
>
> On Tue, Feb 13, 2018, 10:53 AM Kevin Lam <ke...@fathomhealth.co> wrote:
>
> > Hi all,
> >
> > My team and I have been experimenting with Airflow and Kubernetes, and
> > there has been a lot of activity recently with the Kubernetes Executor so
> > hopefully someone can help us out.
> >
> > Specifically, we are using our own variant of the kubernetes executor to
> > run some pods on pre-emptible VMs on GKE (
> > https://cloud.google.com/kubernetes-engine/docs/concepts/preemptible-vm
> ),
> > and were wondering if anyone had an advice regarding how to handle
> > pre-emptions of nodes in a graceful way.
> >
> > Currently, if a node gets pre-empted and is removed our pod dies causing
> a
> > corresponding airflow task to fail, but in such cases we'd really like
> the
> > pod to be recreated and the task go continue on. At the same time we want
> > other 'normal' failures to cause the airflow task to fail.
> >
> > One idea is to use jobs instead of pods, but if I recall correctly there
> > was already a bunch of discussion on this topic for the apache Kube
> > Executor, and in the end pods were chosen.
> >
> > Does anyone have any ideas about how to work with pre-emptible
> > VMs+GKE+Airflow? Any help is appreciated!
> >
> > Thanks,
> > Kevin
> >
>

Reply via email to