I think it should be controllable at the moment when you start backfill.

IMHO default behaviour should be that there should be very little
concurrency reserved for backfills so that you could run backfills without
impacting regular runs - say "max 10 backfill task instances" and "max 3
backfill runs".

But then - you should be able to address special case when you want to
prioritise the backfills - and in certain cases even starve the regular
runs because you REALLY need to backfill old data asap  - and there you
should be able to override the max for specific backfill instance.

J.


On Thu, Oct 3, 2024 at 8:16 PM Daniel Standish
<daniel.stand...@astronomer.io.invalid> wrote:

> Just adding the [DISCUSS] prefix, which I forgot to add.
>
> On Thu, Oct 3, 2024 at 4:23 PM Daniel Standish <
> daniel.stand...@astronomer.io> wrote:
>
> > Ok so, I'm thinking through what makes sense re concurrency control in
> > backfill.
> >
> > It was referred to
> > <
> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=311627729#AIP78Schedulermanagedbackfill-Otherideasunderconsideration
> >
> > in the AIP but I didn't define the behavior:
> >
> > Other ideas under consideration
> >>
> >>    - Add extra concurrency control on dag run
> >>
> >>
> >>    - Apply max active dag runs separately for backfill
> >>
> >>
> >>    - Override any dag param in creating the backfill job and it’s only
> >>    applied in that scope
> >>
> >>
> >>
> > As I have proceeded with implementation, here's what I went with:
> >
> > Each "backfill" gets its own concurrency control ("max_active_runs") that
> > is evaluated completely separate from the DAG scope max_active_runs
> >
> > So if DAG max active runs is 2, and the backfill max active runs is 1,
> > then you can have max of 3 concurrent runs.  Your non-backfill dags
> cannot
> > starve out the backfill ones, and backfill dag runs cannot starve out the
> > non-backfill ones.
> >
> > The other way to go is to say that DAG.max_active_runs is global.  This
> > does not feel quite right to me cus it gets a bit murky.  E.g. what
> happens
> > if DAG.max is 10 and Backfill.max is 10.  Do you allow it?  What do you
> do
> > to avoid starving out non-backfill runs?
> >
> > What do people think?  Relevant PR is here
> > <https://github.com/apache/airflow/pull/42686>.
> >
>

Reply via email to