Please also open JIRAs for this stuff so people can see what feature work is going on without tracking the mailing list.
On Wed, Apr 27, 2016 at 11:10 PM, Chris Riccomini <[email protected]> wrote: > Hey Maxime, > > Great, thanks. > > > We should share our roadmap and sprints systematically, I'll talk to > our PM about making this part of the process. > > Keep in mind that you guys will need to get feedback from the community. > Deciding on how things are implemented (e.g. how DAGs are deployed in > Airflow (is this what git time machine is? I have concerns about using Git > as a deployment mechanism, as you described with Data Swarm)) has to be > done collectively. > > Cheers, > Chris > > On Wed, Apr 27, 2016 at 11:03 PM, Maxime Beauchemin < > [email protected]> wrote: > >> Dan's got a work in progress PR out here around refactoring the dependency >> engine: >> https://github.com/airbnb/airflow/pull/1435 >> >> Paul, can you share the work you're doing on the scheduler or your plans? >> The idea there is to parse dags only in short lived subprocesses. >> >> As for the "git time machine" I believe Paul has a wiki page we're getting >> ready to share. Dan has worked on git sync at scale for CI workloads at >> Twitter, so that brings extra confidence in this approach. >> >> About docker/containment it's pretty much just conversations so far. We're >> struggling with the idea of getting some of our chef recipes assets like >> service discovery inside docker containers. Juggling with container in a >> chef world is pretty foreign to all of us. >> >> Most pieces aren't exactly in movement, but we know big things are going >> to >> move soon. >> >> We should share our roadmap and sprints systematically, I'll talk to our >> PM >> about making this part of the process. >> >> Max >> >> On Wed, Apr 27, 2016 at 10:22 PM, Chris Riccomini <[email protected]> >> wrote: >> >> > > There are many other large pieces in movement (distributing the >> scheduler >> > and parsing DagBag in subprocesses, the git time machine, >> > docker/containment, >> > ...). >> > >> > Maxime, can you please get the work you're doing documented somewhere >> > public? >> > >> > On Wed, Apr 27, 2016 at 4:03 PM, Maxime Beauchemin < >> > [email protected]> wrote: >> > >> > > Notes related to the proposal here: >> > > >> https://github.com/airbnb/airflow/wiki/DagRun-Refactor-(Scheduler-2.0) >> > > >> > > * All of this seems very sound to me. Moving the methods to the right >> > > places will bring a lot of clarity. I clearly see that I'm not alone >> > > understanding the current challenges and potential solutions anymore! >> > This >> > > is awesome! >> > > * DagRun.run_id's purpose is to allow people to define something >> > meaningful >> > > to the grain of their ETL. Say if you wait on a genome file in a >> folder >> > and >> > > want a DagRun for each genome file, you can put your unique filename >> as >> > > that run_id and refer to it in your templates/code. It's more of way >> for >> > > people to express and use their own "run id" that is meaningful to >> them >> > and >> > > carry it through inside Airflow. Airflow's internals would always use >> > > dag_id and execution_date internally as the key regardless of run_id. >> > > >> > > * what goes in DagRun.lock_id? the job_id of the process managing it? >> > What >> > > if it needs to be restarted? We could also just have DagRun.type where >> > type >> > > is either 'backfill' or 'scheduler'. backfilling to overwrite >> scheduler >> > job >> > > may mean that backfill would appropriates itself the DagRuns that are >> not >> > > in a running state. Lots of complexity and edge cases in this area... >> > > * One constraint around backfill (until we get the git time-machine >> up) >> > is >> > > to allow users to run local code with no handoff to the scheduler, so >> > that >> > > you can go to any version of your DAG in your local repo and run the >> DAG >> > as >> > > defined locally >> > > * I'm unclear on DagRunJob being sync or async, the scheduler needs >> it to >> > > be async I think, backfill overall should be synchronous and log >> progress >> > > * Some of the design might need to change to accommodate for the >> > subprocess >> > > handling I just described in the Google group ( >> > > https://groups.google.com/forum/#!topic/airbnb_airflow/96hd61T7kgg) >> that >> > > Paul is working on, but essentially the scheduling needs to take place >> > in a >> > > subprocess and should be async. For backfill it's not a constraint. I >> > could >> > > take place in the main process and can be synchronous... >> > > >> > > All of this is fairly brutal and should be broken down in many small >> PRs >> > > (3? 5?). There are many other large pieces in movement (distributing >> the >> > > scheduler and parsing DagBag in subprocesses, the git time machine, >> > > docker/containment, ...). We should land the pieces that help >> everything >> > > else fall into place, and be very careful of changes that make other >> > pieces >> > > of the puzzle harder to fit in. >> > > >> > > Max >> > > >> > >> > >
