Dan's got a work in progress PR out here around refactoring the dependency
engine:
https://github.com/airbnb/airflow/pull/1435

Paul, can you share the work you're doing on the scheduler or your plans?
The idea there is to parse dags only in short lived subprocesses.

As for the "git time machine" I believe Paul has a wiki page we're getting
ready to share. Dan has worked on git sync at scale for CI workloads at
Twitter, so that brings extra confidence in this approach.

About docker/containment it's pretty much just conversations so far. We're
struggling with the idea of getting some of our chef recipes assets like
service discovery inside docker containers. Juggling with container in a
chef world is pretty foreign to all of us.

Most pieces aren't exactly in movement, but we know big things are going to
move soon.

We should share our roadmap and sprints systematically, I'll talk to our PM
about making this part of the process.

Max

On Wed, Apr 27, 2016 at 10:22 PM, Chris Riccomini <[email protected]>
wrote:

> > There are many other large pieces in movement (distributing the scheduler
> and parsing DagBag in subprocesses, the git time machine,
> docker/containment,
> ...).
>
> Maxime, can you please get the work you're doing documented somewhere
> public?
>
> On Wed, Apr 27, 2016 at 4:03 PM, Maxime Beauchemin <
> [email protected]> wrote:
>
> > Notes related to the proposal here:
> > https://github.com/airbnb/airflow/wiki/DagRun-Refactor-(Scheduler-2.0)
> >
> > * All of this seems very sound to me. Moving the methods to the right
> > places will bring a lot of clarity. I clearly see that I'm not alone
> > understanding the current challenges and potential solutions anymore!
> This
> > is awesome!
> > * DagRun.run_id's purpose is to allow people to define something
> meaningful
> > to the grain of their ETL. Say if you wait on a genome file in a folder
> and
> > want a DagRun for each genome file, you can put your unique filename as
> > that run_id and refer to it in your templates/code. It's more of way for
> > people to express and use their own "run id" that is meaningful to them
> and
> > carry it through inside Airflow. Airflow's internals would always use
> > dag_id and execution_date internally as the key regardless of run_id.
> >
> > * what goes in DagRun.lock_id? the job_id of the process managing it?
> What
> > if it needs to be restarted? We could also just have DagRun.type where
> type
> > is either 'backfill' or 'scheduler'. backfilling to overwrite scheduler
> job
> > may mean that backfill would appropriates itself the DagRuns that are not
> > in a running state. Lots of complexity and edge cases in this area...
> > * One constraint around backfill (until we get the git time-machine up)
> is
> > to allow users to run local code with no handoff to the scheduler, so
> that
> > you can go to any version of your DAG in your local repo and run the DAG
> as
> > defined locally
> > * I'm unclear on DagRunJob being sync or async, the scheduler needs it to
> > be async I think, backfill overall should be synchronous and log progress
> > * Some of the design might need to change to accommodate for the
> subprocess
> > handling I just described in the Google group (
> > https://groups.google.com/forum/#!topic/airbnb_airflow/96hd61T7kgg) that
> > Paul is working on, but essentially the scheduling needs to take place
> in a
> > subprocess and should be async. For backfill it's not a constraint. I
> could
> > take place in the main process and can be synchronous...
> >
> > All of this is fairly brutal and should be broken down in many small PRs
> > (3? 5?). There are many other large pieces in movement (distributing the
> > scheduler and parsing DagBag in subprocesses, the git time machine,
> > docker/containment, ...). We should land the pieces that help everything
> > else fall into place, and be very careful of changes that make other
> pieces
> > of the puzzle harder to fit in.
> >
> > Max
> >
>

Reply via email to