Hello Airflow community,

I'm currently putting Airflow into production at my company of 2000+
people. The most significant sticking point so far is the deployment /
execution model. I wanted to write up my experience so far in this matter
and see how other people are dealing with this issue.

First of all, our goal is to allow engineers to author DAGs and easily
deploy them. That means they should be able to make changes to their DAGs,
add/remove dependencies, and not have to  redeploy any of the core
component (scheduler, webserver, workers).

Our first attempt at productionizing Airflow used the vanilla DAGs folder,
and including all the deps of all the DAGs with the airflow binary itself.
Unfortunately, that meant we had to redeploy the scheduler, webserver
and/or workers every time a dependency changed, which will definitely not
work for us long term.

The next option we considered was to use the "packaged DAGs" approach,
whereby you place dependencies in a zip file. This would not work for us,
due to the lack of support for dynamic libraries (see
https://airflow.apache.org/concepts.html#packaged-dags)

We have finally arrived at an option that seems reasonable, which is to use
a single Operator that shells out to various binary targets that we build
independently of Airflow, and which include their own dependencies.
Configuration is serialized via protobuf and passed over stdin to the
subprocess. The parent process (which is in Airflow's memory space) streams
the logs from stdout and stderr.

This approach has the advantage of working seamlessly with our build
system, and allowing us to redeploy DAGs even when dependencies in the
operator implementations change.

Any thoughts / comments / feedback? Have people faced similar issues out
there?

Many thanks,


-G Silk

Reply via email to