I'm interested in this myself, in terms of being able to update dags on docker instances. I have done some work on this with using a volume. Basically a volume container that contains the DAGs. On all the different types of instances, webservers, schedulers, workers the DAGs are mounted to /usr/local/airflow/dags.
I'm using Rancher, and with a docker-compose.yml I was able to create a sidekick container. This ensures that wherever the other components are deployed there is also a copy of the DAG container. The only issue with this is that whenever any sidekick is updated, it also updates the other containers. I haven't found a way yet to work around this. What's nice about this is that the DAG container is versioned in Docker with tags. I also considered using Convoy with Elastic Storage. It's a little more clunky copying things to external storage and doesn't really have the benefit of versioning so easily. --Danny On Wed, Sep 14, 2016 at 3:03 PM Vijay Bhat <[email protected]> wrote: > Hi Max, > > That's very helpful. Look forward to the version semantics features in > Airflow. Until then I will use chef (or alternatives). > > Thanks, > Vijay > > On Thu, Sep 8, 2016 at 8:39 AM, Maxime Beauchemin < > [email protected]> wrote: > > > Hi Vijay, > > > > Up until recently we had the assumption that people had already their own > > way of syncing GH repos on their infrastructure. In our case at Airbnb > it's > > chef, and pretty much every company has their own way of doing this and > is > > a requirement for distributed Airflow. > > > > A related item on our roadmap is to allow for adding version semantics > (git > > SHAs) in the communication layer so that workers would fetch shallow > clones > > of the DAG repository as of a specific version. We were debating on using > > some form of serialization versus this approach and decided to fully > > embrace configuration as code, and shy away from the serialization / > > artifact management which brings in many challenges and limitations, > > especially in Python. > > > > As we roll this change out, Airflow won't rely on external services to > sync > > up repos, and we'll have a solid story around versioning. Of course that > > implies that Git becomes a critical hotspot in the cluster. We're > planning > > to ship this feature as opt-in, at least until 2.0 > > > > To the community, we'll share a formal design doc in the near future, in > > the meantime this thread can be a good place for discussing this solution > > at a high level. > > > > Thanks, > > > > Max > > > > On Wed, Sep 7, 2016 at 3:25 PM, Vijay Bhat <[email protected]> wrote: > > > > > Hi all, > > > > > > First off, I want to thank the Airflow community for developing a > > fantastic > > > data pipelining platform. I used Dataswarm extensively while I was at > > > Facebook, and it's awesome to see most of the functionality available > for > > > the rest of the world to use in the form of Airflow. > > > > > > What I haven't found in the documentation is a prescribed way to > connect > > > the source control repo for the DAG code to the Airflow DAG folder to > > make > > > sure the latest code changes are picked up by the scheduler. In the > > Airflow > > > forums, I have seen people mention using cron / chef / puppet etc, but > no > > > git webhook (https://developer.github.com/v3/repos/hooks/) based > > methods. > > > > > > Using webhooks would prevent the need to poll the repo for changes. For > > > example, Jenkins uses webhooks to auto-trigger builds - > > > > https://wiki.jenkins-ci.org/display/JENKINS/Github+Plugin#GithubPlugin- > > > TriggerabuildwhenachangeispushedtoGitHub. > > > Does Airflow have a way of configuring something similar? > > > > > > Thanks! > > > Vijay > > > > > >
