Please also open JIRAs for this stuff so people can see what feature work
is going on without tracking the mailing list.

On Wed, Apr 27, 2016 at 11:10 PM, Chris Riccomini <[email protected]>
wrote:

> Hey Maxime,
>
> Great, thanks.
>
> > We should share our roadmap and sprints systematically, I'll talk to
> our PM about making this part of the process.
>
> Keep in mind that you guys will need to get feedback from the community.
> Deciding on how things are implemented (e.g. how DAGs are deployed in
> Airflow (is this what git time machine is? I have concerns about using Git
> as a deployment mechanism, as you described with Data Swarm)) has to be
> done collectively.
>
> Cheers,
> Chris
>
> On Wed, Apr 27, 2016 at 11:03 PM, Maxime Beauchemin <
> [email protected]> wrote:
>
>> Dan's got a work in progress PR out here around refactoring the dependency
>> engine:
>> https://github.com/airbnb/airflow/pull/1435
>>
>> Paul, can you share the work you're doing on the scheduler or your plans?
>> The idea there is to parse dags only in short lived subprocesses.
>>
>> As for the "git time machine" I believe Paul has a wiki page we're getting
>> ready to share. Dan has worked on git sync at scale for CI workloads at
>> Twitter, so that brings extra confidence in this approach.
>>
>> About docker/containment it's pretty much just conversations so far. We're
>> struggling with the idea of getting some of our chef recipes assets like
>> service discovery inside docker containers. Juggling with container in a
>> chef world is pretty foreign to all of us.
>>
>> Most pieces aren't exactly in movement, but we know big things are going
>> to
>> move soon.
>>
>> We should share our roadmap and sprints systematically, I'll talk to our
>> PM
>> about making this part of the process.
>>
>> Max
>>
>> On Wed, Apr 27, 2016 at 10:22 PM, Chris Riccomini <[email protected]>
>> wrote:
>>
>> > > There are many other large pieces in movement (distributing the
>> scheduler
>> > and parsing DagBag in subprocesses, the git time machine,
>> > docker/containment,
>> > ...).
>> >
>> > Maxime, can you please get the work you're doing documented somewhere
>> > public?
>> >
>> > On Wed, Apr 27, 2016 at 4:03 PM, Maxime Beauchemin <
>> > [email protected]> wrote:
>> >
>> > > Notes related to the proposal here:
>> > >
>> https://github.com/airbnb/airflow/wiki/DagRun-Refactor-(Scheduler-2.0)
>> > >
>> > > * All of this seems very sound to me. Moving the methods to the right
>> > > places will bring a lot of clarity. I clearly see that I'm not alone
>> > > understanding the current challenges and potential solutions anymore!
>> > This
>> > > is awesome!
>> > > * DagRun.run_id's purpose is to allow people to define something
>> > meaningful
>> > > to the grain of their ETL. Say if you wait on a genome file in a
>> folder
>> > and
>> > > want a DagRun for each genome file, you can put your unique filename
>> as
>> > > that run_id and refer to it in your templates/code. It's more of way
>> for
>> > > people to express and use their own "run id" that is meaningful to
>> them
>> > and
>> > > carry it through inside Airflow. Airflow's internals would always use
>> > > dag_id and execution_date internally as the key regardless of run_id.
>> > >
>> > > * what goes in DagRun.lock_id? the job_id of the process managing it?
>> > What
>> > > if it needs to be restarted? We could also just have DagRun.type where
>> > type
>> > > is either 'backfill' or 'scheduler'. backfilling to overwrite
>> scheduler
>> > job
>> > > may mean that backfill would appropriates itself the DagRuns that are
>> not
>> > > in a running state. Lots of complexity and edge cases in this area...
>> > > * One constraint around backfill (until we get the git time-machine
>> up)
>> > is
>> > > to allow users to run local code with no handoff to the scheduler, so
>> > that
>> > > you can go to any version of your DAG in your local repo and run the
>> DAG
>> > as
>> > > defined locally
>> > > * I'm unclear on DagRunJob being sync or async, the scheduler needs
>> it to
>> > > be async I think, backfill overall should be synchronous and log
>> progress
>> > > * Some of the design might need to change to accommodate for the
>> > subprocess
>> > > handling I just described in the Google group (
>> > > https://groups.google.com/forum/#!topic/airbnb_airflow/96hd61T7kgg)
>> that
>> > > Paul is working on, but essentially the scheduling needs to take place
>> > in a
>> > > subprocess and should be async. For backfill it's not a constraint. I
>> > could
>> > > take place in the main process and can be synchronous...
>> > >
>> > > All of this is fairly brutal and should be broken down in many small
>> PRs
>> > > (3? 5?). There are many other large pieces in movement (distributing
>> the
>> > > scheduler and parsing DagBag in subprocesses, the git time machine,
>> > > docker/containment, ...). We should land the pieces that help
>> everything
>> > > else fall into place, and be very careful of changes that make other
>> > pieces
>> > > of the puzzle harder to fit in.
>> > >
>> > > Max
>> > >
>> >
>>
>
>

Reply via email to