Re: Airflow kubernetes executor

Jeremiah Lowin Thu, 13 Jul 2017 05:19:28 -0700

Hi Gerard (and anyone else for whom this might be helpful),

We've run Airflow on GCP for a few years. The structure has changed over
time but at the moment we use the following basic outline:


1. Build a container that includes all Airflow and DAG dependencies and
push it to Google container registry. If you need to add/update
dependencies or update airflow.cfg, simply push a new image
2. All DAGs are pushed to a git repo
3. Host the AirflowDB in Google Cloud SQL
4. Create a Kuberenetes deployment that runs the following containers:
-- Airflow scheduler (using the dependencies image)
-- Airflow webserver (using the dependencies image)
-- Airflow maintainence (using the dependencies image) - this container
does nothing (sleep infinity) but since it shares the same setup as the
scheduler/webserver, it's an easy place to `exec` into the cluster to
investigate any issues that might be crashing the main containers. We limit
its CPU to minimize impact on cluster resources. Hacky but effective.
-- cloud sql proxy (https://cloud.google.com/sql/docs/postgres/sql-proxy) -
to connect to the Airflow DB
-- git-sync (https://github.com/jlowin/git-sync)

The last container (git-sync) is a small library I wrote to solve the issue
of syncing DAGs. It's not perfect and ***I am NOT offering any support for
it*** but it gets the job done. It's meant to be a sidecar container and
does one thing: constantly fetch a git repo to a local folder. In your
deployment, create an EmptyDir volume and mount it in all containers
(except cloud sql). Git-sync should use that volume as its target, and
scheduler/webserver should use the volume as the DAGs folder. That way,
every 30 seconds, git-sync will fetch the git repo in to that volume, and
the Airflow containers will immediately see the latest files appear.

5. Create a Kubernetes service to expose the webserver UI

Our actual implementation is considerably more complicated than this since
we have extensive custom modules that are loaded via git-sync rather than
being baked into the image, as well as a few other GCP service
integrations, but this overview should point in the right direction.
Getting it running the first time requires a little elbow grease but once
built, it's easy to automate the process.

Best,
Jeremiah



On Thu, Jul 13, 2017 at 3:50 AM Gerard Toonstra <[email protected]> wrote:

> It would be really good if you'd share experiences on how to run this on
> kubernetes and ECS.
> I'm not aware of a good guide on how to run this on either for example, but
> it's a very useful and
> quick setup to start with, especially combining that with deployment
> manager and cloudformation (probably).
>
> I'm talking to someone else who's looking at running on kubernetes and
> potentially opensourcing a generic
> template for kubernetes deployments.
>
>
> Would it be possible to share your experiences?  What tech are you using
> for specific issues?
>
> - how do you deploy and sync dags?  Are you using EFS?
> - how you do build the container with airflow + executables?
> - where do you send log files or log lines to?
> - High Availability and how?
>
> Really looking forward to how that's done, so we can put this on the wiki.
>
> Especially since GCP is now also starting to embrace airflow, it'd be good
> to have a better understanding
> how easy and quick it can be to deploy airflow on gcp:
>
>
> https://cloud.google.com/blog/big-data/2017/07/how-to-aggregate-data-for-bigquery-using-apache-airflow
>
> Rgds,
>
> Gerard
>
>
> On Wed, Jul 12, 2017 at 8:55 PM, Arthur Purvis <[email protected]>
> wrote:
>
> > for what it's worth we've been running airflow on ECS for a few years
> > already.
> >
> > On Wed, Jul 12, 2017 at 12:21 PM, Grant Nicholas <
> > [email protected]> wrote:
> >
> > > Is having a static set of workers necessary? Launching a job on
> > Kubernetes
> > > from a cached docker image takes a few seconds max. I think this is an
> > > acceptable delay for a batch processing system like airflow.
> > >
> > > Additionally, if you dynamically launch workers you can start
> dynamically
> > > launching *any type* of worker and you don't have to statically
> allocate
> > > pools of worker types. IE) A single DAG could use a scala docker image
> to
> > > do spark calculations, a C++ docker image to use some low level
> numerical
> > > library,  and a python docker image by default to do any generic
> airflow
> > > stuff. Additionally, you can size workers according to their usage.
> Maybe
> > > the spark driver program only needs a few GBs of RAM but the C++
> > numerical
> > > library needs many hundreds.
> > >
> > > I agree there is a bit of extra book-keeping that needs to be done, but
> > > the tradeoff is an important one to explicitly make.
> > >
> >
>

Re: Airflow kubernetes executor

Reply via email to