@Grant @Gerard: WRT Static workers: I think that the downside of a slightly longer start-up time is significantly less important than the massive upshot in scalability offered by having non-static workers. This also opens up really interesting opportunities (i.e. being able to identify how many resources to give a task, lauching tasks with specific dependencies installed, etc.).
For our current plan for k8s deployment we're looking into the following: 1: Launch an NFS cluster with a side-car that consistently polls github/artifactory. 2: If there is any change to github/artifcatory, pull down the latest version of the code 3: every time a slave task starts, have it attach to the NFS cluster as a volume mount. This method would work with pretty much any system that kubernetes allows as a persistentVolumeClaim. @Jeremiah: Your system actually sounds a lot like ours (except we have pretty heavy regulations against cloud services so we're doing our stuff a lot more bare-metal). git-sync definitely works for a lot of use-cases. The main issue for companies like mine is that there are certain robustness/availability issues with pulling code straight from github -> production (i.e. if our git enterprise goes down). I might speak to the k8s guys about implementing an artifactory PVC. Until then we'll probably just create an "artifactory-sync" and have a jenkins job that consistently polls github to port to artifactory. I'm glad to see this topic has sparked so much conversation :). On Thu, Jul 13, 2017 at 5:24 AM Jeremiah Lowin <[email protected]> wrote: > p.s. it looks like git-sync has received an "official" release since the > last time I looked at it: https://github.com/kubernetes/git-sync > > On Thu, Jul 13, 2017 at 8:18 AM Jeremiah Lowin <[email protected]> wrote: > > > Hi Gerard (and anyone else for whom this might be helpful), > > > > We've run Airflow on GCP for a few years. The structure has changed over > > time but at the moment we use the following basic outline: > > > > 1. Build a container that includes all Airflow and DAG dependencies and > > push it to Google container registry. If you need to add/update > > dependencies or update airflow.cfg, simply push a new image > > 2. All DAGs are pushed to a git repo > > 3. Host the AirflowDB in Google Cloud SQL > > 4. Create a Kuberenetes deployment that runs the following containers: > > -- Airflow scheduler (using the dependencies image) > > -- Airflow webserver (using the dependencies image) > > -- Airflow maintainence (using the dependencies image) - this container > > does nothing (sleep infinity) but since it shares the same setup as the > > scheduler/webserver, it's an easy place to `exec` into the cluster to > > investigate any issues that might be crashing the main containers. We > limit > > its CPU to minimize impact on cluster resources. Hacky but effective. > > -- cloud sql proxy (https://cloud.google.com/sql/docs/postgres/sql-proxy > ) > > - to connect to the Airflow DB > > -- git-sync (https://github.com/jlowin/git-sync) > > > > The last container (git-sync) is a small library I wrote to solve the > > issue of syncing DAGs. It's not perfect and ***I am NOT offering any > > support for it*** but it gets the job done. It's meant to be a sidecar > > container and does one thing: constantly fetch a git repo to a local > > folder. In your deployment, create an EmptyDir volume and mount it in all > > containers (except cloud sql). Git-sync should use that volume as its > > target, and scheduler/webserver should use the volume as the DAGs folder. > > That way, every 30 seconds, git-sync will fetch the git repo in to that > > volume, and the Airflow containers will immediately see the latest files > > appear. > > > > 5. Create a Kubernetes service to expose the webserver UI > > > > Our actual implementation is considerably more complicated than this > since > > we have extensive custom modules that are loaded via git-sync rather than > > being baked into the image, as well as a few other GCP service > > integrations, but this overview should point in the right direction. > > Getting it running the first time requires a little elbow grease but once > > built, it's easy to automate the process. > > > > Best, > > Jeremiah > > > > > > > > On Thu, Jul 13, 2017 at 3:50 AM Gerard Toonstra <[email protected]> > > wrote: > > > >> It would be really good if you'd share experiences on how to run this on > >> kubernetes and ECS. > >> I'm not aware of a good guide on how to run this on either for example, > >> but > >> it's a very useful and > >> quick setup to start with, especially combining that with deployment > >> manager and cloudformation (probably). > >> > >> I'm talking to someone else who's looking at running on kubernetes and > >> potentially opensourcing a generic > >> template for kubernetes deployments. > >> > >> > >> Would it be possible to share your experiences? What tech are you using > >> for specific issues? > >> > >> - how do you deploy and sync dags? Are you using EFS? > >> - how you do build the container with airflow + executables? > >> - where do you send log files or log lines to? > >> - High Availability and how? > >> > >> Really looking forward to how that's done, so we can put this on the > wiki. > >> > >> Especially since GCP is now also starting to embrace airflow, it'd be > good > >> to have a better understanding > >> how easy and quick it can be to deploy airflow on gcp: > >> > >> > >> > https://cloud.google.com/blog/big-data/2017/07/how-to-aggregate-data-for-bigquery-using-apache-airflow > >> > >> Rgds, > >> > >> Gerard > >> > >> > >> On Wed, Jul 12, 2017 at 8:55 PM, Arthur Purvis <[email protected]> > >> wrote: > >> > >> > for what it's worth we've been running airflow on ECS for a few years > >> > already. > >> > > >> > On Wed, Jul 12, 2017 at 12:21 PM, Grant Nicholas < > >> > [email protected]> wrote: > >> > > >> > > Is having a static set of workers necessary? Launching a job on > >> > Kubernetes > >> > > from a cached docker image takes a few seconds max. I think this is > an > >> > > acceptable delay for a batch processing system like airflow. > >> > > > >> > > Additionally, if you dynamically launch workers you can start > >> dynamically > >> > > launching *any type* of worker and you don't have to statically > >> allocate > >> > > pools of worker types. IE) A single DAG could use a scala docker > >> image to > >> > > do spark calculations, a C++ docker image to use some low level > >> numerical > >> > > library, and a python docker image by default to do any generic > >> airflow > >> > > stuff. Additionally, you can size workers according to their usage. > >> Maybe > >> > > the spark driver program only needs a few GBs of RAM but the C++ > >> > numerical > >> > > library needs many hundreds. > >> > > > >> > > I agree there is a bit of extra book-keeping that needs to be done, > >> but > >> > > the tradeoff is an important one to explicitly make. > >> > > > >> > > >> > > >
