Thanks for the reply Joy, let me walk you though things as they are today
1) we don't stop airflow or disable DAGs while deploying updates to logic,
this is done live once its released
2) the python script in the DAG folder doesn't actually have DAGs in it but
is a shim layer to allow us to deploy in a atomic way for a single host
2.1) this script reads a file on local disk (less than disk page size) to
find latest git commit deployed
2.2) re-does the airflow DAG load process but pointing to the git commit
Example directory structure
/latest # pointer to latest commit
This is how we make sure deploys are consistent within a single task.
Now, lets assume we have a fully atomic commit process and are able to
upgrade DAGs at the exact same moment.
At time T0 the scheduler knows of DAG V1 and schedules two tasks, Task1,
At time T1 Task1 is picked up by Worker1, so starts executing the task (V1
At time T2 deploy commit happens, current DAG version: V2
At time T3, Task2 is picked up by Worker2, so starts executing the task (V2
In many cases this isn't really a problem (tuning config change to hadoop
job), but as we have more people using Airflow this is causing a lot of
time spent debugging why production acted differently than expected (the
problem was already fixed... why is it still here?). We also see that some
tasks expect a given behavior from other tasks, and since they live in the
same git repo they can modify both tasks at the same time if a breaking
change is needed, but when this rolls out to prod there isn't a way to do
this other than turn off the DAG, and login to all hosts to verify fully
We would like to remove this confusion and make generations/versions (same
thing really) exposed to users and make sure for a single dag_run only one
version is used.
I hope this is more clear.
On Fri, Feb 23, 2018 at 1:37 PM, Joy Gao <j...@wepay.com> wrote:
> Hi David,
> Do you mind providing a concrete example of the scenario in which
> scheduler/workers see different states (I'm not 100% sure if I understood
> the issue at hand).
> And by same dag generation, are you referring to the dag version? (DAG
> version is currently not supported at all, but I can see it being a
> building block for future use cases).
> On Fri, Feb 23, 2018 at 1:00 PM, David Capwell <dcapw...@gmail.com> wrote:
> > My current thinking is to add a field to the dag table that is optional
> > provided by the dag. We currently intercept the load path do could use
> > field to make sure we load the same generation. My concern here is the
> > interaction with the scheduler, not as familiar with that logic to
> > corner cases were this would fail.
> > Any other recommendations for how this could be done?
> > On Mon, Feb 19, 2018, 10:33 PM David Capwell <dcapw...@gmail.com> wrote:
> > > We have been using airflow for logic that delegates to other systems so
> > > inject a task all tasks depends to make sure all resources used are the
> > > same for all tasks in the dag. This works well for tasks that delegates
> > to
> > > external systems but people are starting to need to run logic in
> > > and the fact that scheduler and all workers can see different states is
> > > causing issues
> > >
> > > We can make sure that all the code is deployed in a consistent way but
> > > need help from the scheduler to tell the workers the current generation
> > for
> > > a DAG.
> > >
> > > My question is, what would be the best way to modify airflow to allow
> > DAGs
> > > to define a generation value that the scheduler could send to workers?
> > >
> > > Thanks
> > >