I can see how my first email was confusing, where I said: "Our first attempt at productionizing Airflow used the vanilla DAGs folder, including all the deps of all the DAGs with the airflow binary itself"
What I meant is that we have separate DAGs deployment, but we are being forced to package the *dependencies of the DAGs* with the Airflow binary, because that's the only way to make the DAG definitions work. On Wed, Oct 31, 2018 at 11:18 PM, Gabriel Silk <gs...@dropbox.com> wrote: > Our DAG deployment is already a separate deployment from Airflow itself. > > The issue is that the Airflow binary (whether acting as webserver, > scheduler, worker), is the one that *reads* the DAG files. So if you > have, for example, a DAG that has this import statement in it: > > import mylib.foobar > > Then the only way to successfully interpret this DAG definition in the > Airflow process, is if you package the Airflow binary with the mylib.foobar > dependency. > > This implies that every time you add a new dependency in one of your DAG > definitions, you have to re-deploy Airflow itself, not just the DAG > definitions. > > > On Wed, Oct 31, 2018 at 2:45 PM, Maxime Beauchemin < > maximebeauche...@gmail.com> wrote: > >> Deploying the DAGs should be decoupled from deploying Airflow itself. You >> can just use a resource that syncs the DAGs repo to the boxes on the >> Airflow cluster periodically (say every minute). Resource orchestrators >> like Chef, Ansible, Puppet, should have some easy way to do that. Either >> that or some sort of mount or mount-equivalent (k8s has constructs for >> that, EFS on Amazon). >> >> Also note that the DagFetcher abstraction that's been discussed before on >> the mailing list would solve this and more. >> >> Max >> >> On Wed, Oct 31, 2018 at 2:37 PM Gabriel Silk <gs...@dropbox.com.invalid> >> wrote: >> >> > Hello Airflow community, >> > >> > >> > I'm currently putting Airflow into production at my company of 2000+ >> > people. The most significant sticking point so far is the deployment / >> > execution model. I wanted to write up my experience so far in this >> matter >> > and see how other people are dealing with this issue. >> > >> > First of all, our goal is to allow engineers to author DAGs and easily >> > deploy them. That means they should be able to make changes to their >> DAGs, >> > add/remove dependencies, and not have to redeploy any of the core >> > component (scheduler, webserver, workers). >> > >> > Our first attempt at productionizing Airflow used the vanilla DAGs >> folder, >> > and including all the deps of all the DAGs with the airflow binary >> itself. >> > Unfortunately, that meant we had to redeploy the scheduler, webserver >> > and/or workers every time a dependency changed, which will definitely >> not >> > work for us long term. >> > >> > The next option we considered was to use the "packaged DAGs" approach, >> > whereby you place dependencies in a zip file. This would not work for >> us, >> > due to the lack of support for dynamic libraries (see >> > https://airflow.apache.org/concepts.html#packaged-dags) >> > >> > We have finally arrived at an option that seems reasonable, which is to >> use >> > a single Operator that shells out to various binary targets that we >> build >> > independently of Airflow, and which include their own dependencies. >> > Configuration is serialized via protobuf and passed over stdin to the >> > subprocess. The parent process (which is in Airflow's memory space) >> streams >> > the logs from stdout and stderr. >> > >> > This approach has the advantage of working seamlessly with our build >> > system, and allowing us to redeploy DAGs even when dependencies in the >> > operator implementations change. >> > >> > Any thoughts / comments / feedback? Have people faced similar issues out >> > there? >> > >> > Many thanks, >> > >> > >> > -G Silk >> > >> > >