I'm interested in this myself, in terms of being able to update dags on
docker instances.  I have done some work on this with using a volume.
Basically a volume container that contains the DAGs.  On all the different
types of instances, webservers, schedulers, workers the DAGs are mounted to
/usr/local/airflow/dags.

I'm using Rancher, and with a docker-compose.yml I was able to create a
sidekick container. This ensures that wherever the other components are
deployed there is also a copy of the DAG container.  The only issue with
this is that whenever any sidekick is updated, it also updates the other
containers.  I haven't found a way yet to work around this.  What's nice
about this is that the DAG container is versioned in Docker with tags.

I also considered using Convoy with Elastic Storage.  It's a little more
clunky copying things to external storage and doesn't really have the
benefit of versioning so easily.

--Danny


On Wed, Sep 14, 2016 at 3:03 PM Vijay Bhat <[email protected]> wrote:

> Hi Max,
>
> That's very helpful. Look forward to the version semantics features in
> Airflow. Until then I will use chef (or alternatives).
>
> Thanks,
> Vijay
>
> On Thu, Sep 8, 2016 at 8:39 AM, Maxime Beauchemin <
> [email protected]> wrote:
>
> > Hi Vijay,
> >
> > Up until recently we had the assumption that people had already their own
> > way of syncing GH repos on their infrastructure. In our case at Airbnb
> it's
> > chef, and pretty much every company has their own way of doing this and
> is
> > a requirement for distributed Airflow.
> >
> > A related item on our roadmap is to allow for adding version semantics
> (git
> > SHAs) in the communication layer so that workers would fetch shallow
> clones
> > of the DAG repository as of a specific version. We were debating on using
> > some form of serialization versus this approach and decided to fully
> > embrace configuration as code, and shy away from the serialization /
> > artifact management which brings in many challenges and limitations,
> > especially in Python.
> >
> > As we roll this change out, Airflow won't rely on external services to
> sync
> > up repos, and we'll have a solid story around versioning. Of course that
> > implies that Git becomes a critical hotspot in the cluster. We're
> planning
> > to ship this feature as opt-in, at least until 2.0
> >
> > To the community, we'll share a formal design doc in the near future, in
> > the meantime this thread can be a good place for discussing this solution
> > at a high level.
> >
> > Thanks,
> >
> > Max
> >
> > On Wed, Sep 7, 2016 at 3:25 PM, Vijay Bhat <[email protected]> wrote:
> >
> > > Hi all,
> > >
> > > First off, I want to thank the Airflow community for developing a
> > fantastic
> > > data pipelining platform. I used Dataswarm extensively while I was at
> > > Facebook, and it's awesome to see most of the functionality available
> for
> > > the rest of the world to use in the form of Airflow.
> > >
> > > What I haven't found in the documentation is a prescribed way to
> connect
> > > the source control repo for the DAG code to the Airflow DAG folder to
> > make
> > > sure the latest code changes are picked up by the scheduler. In the
> > Airflow
> > > forums, I have seen people mention using cron / chef / puppet etc, but
> no
> > > git webhook (https://developer.github.com/v3/repos/hooks/) based
> > methods.
> > >
> > > Using webhooks would prevent the need to poll the repo for changes. For
> > > example, Jenkins uses webhooks to auto-trigger builds -
> > >
> https://wiki.jenkins-ci.org/display/JENKINS/Github+Plugin#GithubPlugin-
> > > TriggerabuildwhenachangeispushedtoGitHub.
> > > Does Airflow have a way of configuring something similar?
> > >
> > > Thanks!
> > > Vijay
> > >
> >
>

Reply via email to