Hi Gerard (and anyone else for whom this might be helpful), We've run Airflow on GCP for a few years. The structure has changed over time but at the moment we use the following basic outline:
1. Build a container that includes all Airflow and DAG dependencies and push it to Google container registry. If you need to add/update dependencies or update airflow.cfg, simply push a new image 2. All DAGs are pushed to a git repo 3. Host the AirflowDB in Google Cloud SQL 4. Create a Kuberenetes deployment that runs the following containers: -- Airflow scheduler (using the dependencies image) -- Airflow webserver (using the dependencies image) -- Airflow maintainence (using the dependencies image) - this container does nothing (sleep infinity) but since it shares the same setup as the scheduler/webserver, it's an easy place to `exec` into the cluster to investigate any issues that might be crashing the main containers. We limit its CPU to minimize impact on cluster resources. Hacky but effective. -- cloud sql proxy (https://cloud.google.com/sql/docs/postgres/sql-proxy) - to connect to the Airflow DB -- git-sync (https://github.com/jlowin/git-sync) The last container (git-sync) is a small library I wrote to solve the issue of syncing DAGs. It's not perfect and ***I am NOT offering any support for it*** but it gets the job done. It's meant to be a sidecar container and does one thing: constantly fetch a git repo to a local folder. In your deployment, create an EmptyDir volume and mount it in all containers (except cloud sql). Git-sync should use that volume as its target, and scheduler/webserver should use the volume as the DAGs folder. That way, every 30 seconds, git-sync will fetch the git repo in to that volume, and the Airflow containers will immediately see the latest files appear. 5. Create a Kubernetes service to expose the webserver UI Our actual implementation is considerably more complicated than this since we have extensive custom modules that are loaded via git-sync rather than being baked into the image, as well as a few other GCP service integrations, but this overview should point in the right direction. Getting it running the first time requires a little elbow grease but once built, it's easy to automate the process. Best, Jeremiah On Thu, Jul 13, 2017 at 3:50 AM Gerard Toonstra <[email protected]> wrote: > It would be really good if you'd share experiences on how to run this on > kubernetes and ECS. > I'm not aware of a good guide on how to run this on either for example, but > it's a very useful and > quick setup to start with, especially combining that with deployment > manager and cloudformation (probably). > > I'm talking to someone else who's looking at running on kubernetes and > potentially opensourcing a generic > template for kubernetes deployments. > > > Would it be possible to share your experiences? What tech are you using > for specific issues? > > - how do you deploy and sync dags? Are you using EFS? > - how you do build the container with airflow + executables? > - where do you send log files or log lines to? > - High Availability and how? > > Really looking forward to how that's done, so we can put this on the wiki. > > Especially since GCP is now also starting to embrace airflow, it'd be good > to have a better understanding > how easy and quick it can be to deploy airflow on gcp: > > > https://cloud.google.com/blog/big-data/2017/07/how-to-aggregate-data-for-bigquery-using-apache-airflow > > Rgds, > > Gerard > > > On Wed, Jul 12, 2017 at 8:55 PM, Arthur Purvis <[email protected]> > wrote: > > > for what it's worth we've been running airflow on ECS for a few years > > already. > > > > On Wed, Jul 12, 2017 at 12:21 PM, Grant Nicholas < > > [email protected]> wrote: > > > > > Is having a static set of workers necessary? Launching a job on > > Kubernetes > > > from a cached docker image takes a few seconds max. I think this is an > > > acceptable delay for a batch processing system like airflow. > > > > > > Additionally, if you dynamically launch workers you can start > dynamically > > > launching *any type* of worker and you don't have to statically > allocate > > > pools of worker types. IE) A single DAG could use a scala docker image > to > > > do spark calculations, a C++ docker image to use some low level > numerical > > > library, and a python docker image by default to do any generic > airflow > > > stuff. Additionally, you can size workers according to their usage. > Maybe > > > the spark driver program only needs a few GBs of RAM but the C++ > > numerical > > > library needs many hundreds. > > > > > > I agree there is a bit of extra book-keeping that needs to be done, but > > > the tradeoff is an important one to explicitly make. > > > > > >
