Notes about DagRun-Refactor-(Scheduler-2.0)

Maxime Beauchemin Wed, 27 Apr 2016 16:03:54 -0700

Notes related to the proposal here:
https://github.com/airbnb/airflow/wiki/DagRun-Refactor-(Scheduler-2.0)


* All of this seems very sound to me. Moving the methods to the right
places will bring a lot of clarity. I clearly see that I'm not alone
understanding the current challenges and potential solutions anymore! This
is awesome!
* DagRun.run_id's purpose is to allow people to define something meaningful
to the grain of their ETL. Say if you wait on a genome file in a folder and
want a DagRun for each genome file, you can put your unique filename as
that run_id and refer to it in your templates/code. It's more of way for
people to express and use their own "run id" that is meaningful to them and
carry it through inside Airflow. Airflow's internals would always use
dag_id and execution_date internally as the key regardless of run_id.

* what goes in DagRun.lock_id? the job_id of the process managing it? What
if it needs to be restarted? We could also just have DagRun.type where type
is either 'backfill' or 'scheduler'. backfilling to overwrite scheduler job
may mean that backfill would appropriates itself the DagRuns that are not
in a running state. Lots of complexity and edge cases in this area...
* One constraint around backfill (until we get the git time-machine up) is
to allow users to run local code with no handoff to the scheduler, so that
you can go to any version of your DAG in your local repo and run the DAG as
defined locally
* I'm unclear on DagRunJob being sync or async, the scheduler needs it to
be async I think, backfill overall should be synchronous and log progress
* Some of the design might need to change to accommodate for the subprocess
handling I just described in the Google group (
https://groups.google.com/forum/#!topic/airbnb_airflow/96hd61T7kgg) that
Paul is working on, but essentially the scheduling needs to take place in a
subprocess and should be async. For backfill it's not a constraint. I could
take place in the main process and can be synchronous...

All of this is fairly brutal and should be broken down in many small PRs
(3? 5?). There are many other large pieces in movement (distributing the
scheduler and parsing DagBag in subprocesses, the git time machine,
docker/containment, ...). We should land the pieces that help everything
else fall into place, and be very careful of changes that make other pieces
of the puzzle harder to fit in.

Max

Notes about DagRun-Refactor-(Scheduler-2.0)

Reply via email to