@Grant @Gerard: WRT Static workers: I think that the downside of a slightly
longer start-up time is significantly less important than the massive
upshot in scalability offered by having non-static workers. This also opens
up really interesting opportunities (i.e. being able to identify how many
resources to give a task, lauching tasks with specific dependencies
installed, etc.).

For our current plan for k8s deployment we're looking into the following:

1: Launch an NFS cluster with a side-car that consistently polls
github/artifactory.
2: If there is any change to github/artifcatory, pull down the latest
version of the code
3: every time a slave task starts, have it attach to the NFS cluster as a
volume mount.

This method would work with pretty much any system that kubernetes allows
as a persistentVolumeClaim.

@Jeremiah: Your system actually sounds a lot like ours (except we have
pretty heavy regulations against cloud services so we're doing our stuff a
lot more bare-metal). git-sync definitely works for a lot of use-cases. The
main issue for companies like mine is that there are certain
robustness/availability issues with pulling code straight from github ->
production (i.e. if our git enterprise goes down). I might speak to the k8s
guys about implementing an artifactory PVC. Until then we'll probably just
create an "artifactory-sync" and have a jenkins job that consistently polls
github to port to artifactory.

I'm glad to see this topic has sparked so much conversation :).

On Thu, Jul 13, 2017 at 5:24 AM Jeremiah Lowin <[email protected]> wrote:

> p.s. it looks like git-sync has received an "official" release since the
> last time I looked at it: https://github.com/kubernetes/git-sync
>
> On Thu, Jul 13, 2017 at 8:18 AM Jeremiah Lowin <[email protected]> wrote:
>
> > Hi Gerard (and anyone else for whom this might be helpful),
> >
> > We've run Airflow on GCP for a few years. The structure has changed over
> > time but at the moment we use the following basic outline:
> >
> > 1. Build a container that includes all Airflow and DAG dependencies and
> > push it to Google container registry. If you need to add/update
> > dependencies or update airflow.cfg, simply push a new image
> > 2. All DAGs are pushed to a git repo
> > 3. Host the AirflowDB in Google Cloud SQL
> > 4. Create a Kuberenetes deployment that runs the following containers:
> > -- Airflow scheduler (using the dependencies image)
> > -- Airflow webserver (using the dependencies image)
> > -- Airflow maintainence (using the dependencies image) - this container
> > does nothing (sleep infinity) but since it shares the same setup as the
> > scheduler/webserver, it's an easy place to `exec` into the cluster to
> > investigate any issues that might be crashing the main containers. We
> limit
> > its CPU to minimize impact on cluster resources. Hacky but effective.
> > -- cloud sql proxy (https://cloud.google.com/sql/docs/postgres/sql-proxy
> )
> > - to connect to the Airflow DB
> > -- git-sync (https://github.com/jlowin/git-sync)
> >
> > The last container (git-sync) is a small library I wrote to solve the
> > issue of syncing DAGs. It's not perfect and ***I am NOT offering any
> > support for it*** but it gets the job done. It's meant to be a sidecar
> > container and does one thing: constantly fetch a git repo to a local
> > folder. In your deployment, create an EmptyDir volume and mount it in all
> > containers (except cloud sql). Git-sync should use that volume as its
> > target, and scheduler/webserver should use the volume as the DAGs folder.
> > That way, every 30 seconds, git-sync will fetch the git repo in to that
> > volume, and the Airflow containers will immediately see the latest files
> > appear.
> >
> > 5. Create a Kubernetes service to expose the webserver UI
> >
> > Our actual implementation is considerably more complicated than this
> since
> > we have extensive custom modules that are loaded via git-sync rather than
> > being baked into the image, as well as a few other GCP service
> > integrations, but this overview should point in the right direction.
> > Getting it running the first time requires a little elbow grease but once
> > built, it's easy to automate the process.
> >
> > Best,
> > Jeremiah
> >
> >
> >
> > On Thu, Jul 13, 2017 at 3:50 AM Gerard Toonstra <[email protected]>
> > wrote:
> >
> >> It would be really good if you'd share experiences on how to run this on
> >> kubernetes and ECS.
> >> I'm not aware of a good guide on how to run this on either for example,
> >> but
> >> it's a very useful and
> >> quick setup to start with, especially combining that with deployment
> >> manager and cloudformation (probably).
> >>
> >> I'm talking to someone else who's looking at running on kubernetes and
> >> potentially opensourcing a generic
> >> template for kubernetes deployments.
> >>
> >>
> >> Would it be possible to share your experiences?  What tech are you using
> >> for specific issues?
> >>
> >> - how do you deploy and sync dags?  Are you using EFS?
> >> - how you do build the container with airflow + executables?
> >> - where do you send log files or log lines to?
> >> - High Availability and how?
> >>
> >> Really looking forward to how that's done, so we can put this on the
> wiki.
> >>
> >> Especially since GCP is now also starting to embrace airflow, it'd be
> good
> >> to have a better understanding
> >> how easy and quick it can be to deploy airflow on gcp:
> >>
> >>
> >>
> https://cloud.google.com/blog/big-data/2017/07/how-to-aggregate-data-for-bigquery-using-apache-airflow
> >>
> >> Rgds,
> >>
> >> Gerard
> >>
> >>
> >> On Wed, Jul 12, 2017 at 8:55 PM, Arthur Purvis <[email protected]>
> >> wrote:
> >>
> >> > for what it's worth we've been running airflow on ECS for a few years
> >> > already.
> >> >
> >> > On Wed, Jul 12, 2017 at 12:21 PM, Grant Nicholas <
> >> > [email protected]> wrote:
> >> >
> >> > > Is having a static set of workers necessary? Launching a job on
> >> > Kubernetes
> >> > > from a cached docker image takes a few seconds max. I think this is
> an
> >> > > acceptable delay for a batch processing system like airflow.
> >> > >
> >> > > Additionally, if you dynamically launch workers you can start
> >> dynamically
> >> > > launching *any type* of worker and you don't have to statically
> >> allocate
> >> > > pools of worker types. IE) A single DAG could use a scala docker
> >> image to
> >> > > do spark calculations, a C++ docker image to use some low level
> >> numerical
> >> > > library,  and a python docker image by default to do any generic
> >> airflow
> >> > > stuff. Additionally, you can size workers according to their usage.
> >> Maybe
> >> > > the spark driver program only needs a few GBs of RAM but the C++
> >> > numerical
> >> > > library needs many hundreds.
> >> > >
> >> > > I agree there is a bit of extra book-keeping that needs to be done,
> >> but
> >> > > the tradeoff is an important one to explicitly make.
> >> > >
> >> >
> >>
> >
>

Reply via email to