> There are many other large pieces in movement (distributing the scheduler
and parsing DagBag in subprocesses, the git time machine, docker/containment,
...).

Maxime, can you please get the work you're doing documented somewhere
public?

On Wed, Apr 27, 2016 at 4:03 PM, Maxime Beauchemin <
[email protected]> wrote:

> Notes related to the proposal here:
> https://github.com/airbnb/airflow/wiki/DagRun-Refactor-(Scheduler-2.0)
>
> * All of this seems very sound to me. Moving the methods to the right
> places will bring a lot of clarity. I clearly see that I'm not alone
> understanding the current challenges and potential solutions anymore! This
> is awesome!
> * DagRun.run_id's purpose is to allow people to define something meaningful
> to the grain of their ETL. Say if you wait on a genome file in a folder and
> want a DagRun for each genome file, you can put your unique filename as
> that run_id and refer to it in your templates/code. It's more of way for
> people to express and use their own "run id" that is meaningful to them and
> carry it through inside Airflow. Airflow's internals would always use
> dag_id and execution_date internally as the key regardless of run_id.
>
> * what goes in DagRun.lock_id? the job_id of the process managing it? What
> if it needs to be restarted? We could also just have DagRun.type where type
> is either 'backfill' or 'scheduler'. backfilling to overwrite scheduler job
> may mean that backfill would appropriates itself the DagRuns that are not
> in a running state. Lots of complexity and edge cases in this area...
> * One constraint around backfill (until we get the git time-machine up) is
> to allow users to run local code with no handoff to the scheduler, so that
> you can go to any version of your DAG in your local repo and run the DAG as
> defined locally
> * I'm unclear on DagRunJob being sync or async, the scheduler needs it to
> be async I think, backfill overall should be synchronous and log progress
> * Some of the design might need to change to accommodate for the subprocess
> handling I just described in the Google group (
> https://groups.google.com/forum/#!topic/airbnb_airflow/96hd61T7kgg) that
> Paul is working on, but essentially the scheduling needs to take place in a
> subprocess and should be async. For backfill it's not a constraint. I could
> take place in the main process and can be synchronous...
>
> All of this is fairly brutal and should be broken down in many small PRs
> (3? 5?). There are many other large pieces in movement (distributing the
> scheduler and parsing DagBag in subprocesses, the git time machine,
> docker/containment, ...). We should land the pieces that help everything
> else fall into place, and be very careful of changes that make other pieces
> of the puzzle harder to fit in.
>
> Max
>

Reply via email to