> There are many other large pieces in movement (distributing the scheduler and parsing DagBag in subprocesses, the git time machine, docker/containment, ...).
Maxime, can you please get the work you're doing documented somewhere public? On Wed, Apr 27, 2016 at 4:03 PM, Maxime Beauchemin < [email protected]> wrote: > Notes related to the proposal here: > https://github.com/airbnb/airflow/wiki/DagRun-Refactor-(Scheduler-2.0) > > * All of this seems very sound to me. Moving the methods to the right > places will bring a lot of clarity. I clearly see that I'm not alone > understanding the current challenges and potential solutions anymore! This > is awesome! > * DagRun.run_id's purpose is to allow people to define something meaningful > to the grain of their ETL. Say if you wait on a genome file in a folder and > want a DagRun for each genome file, you can put your unique filename as > that run_id and refer to it in your templates/code. It's more of way for > people to express and use their own "run id" that is meaningful to them and > carry it through inside Airflow. Airflow's internals would always use > dag_id and execution_date internally as the key regardless of run_id. > > * what goes in DagRun.lock_id? the job_id of the process managing it? What > if it needs to be restarted? We could also just have DagRun.type where type > is either 'backfill' or 'scheduler'. backfilling to overwrite scheduler job > may mean that backfill would appropriates itself the DagRuns that are not > in a running state. Lots of complexity and edge cases in this area... > * One constraint around backfill (until we get the git time-machine up) is > to allow users to run local code with no handoff to the scheduler, so that > you can go to any version of your DAG in your local repo and run the DAG as > defined locally > * I'm unclear on DagRunJob being sync or async, the scheduler needs it to > be async I think, backfill overall should be synchronous and log progress > * Some of the design might need to change to accommodate for the subprocess > handling I just described in the Google group ( > https://groups.google.com/forum/#!topic/airbnb_airflow/96hd61T7kgg) that > Paul is working on, but essentially the scheduling needs to take place in a > subprocess and should be async. For backfill it's not a constraint. I could > take place in the main process and can be synchronous... > > All of this is fairly brutal and should be broken down in many small PRs > (3? 5?). There are many other large pieces in movement (distributing the > scheduler and parsing DagBag in subprocesses, the git time machine, > docker/containment, ...). We should land the pieces that help everything > else fall into place, and be very careful of changes that make other pieces > of the puzzle harder to fit in. > > Max >
