Re: [DISCUSS] AIP-78 scheduler-managed backfill

Daniel Standish Fri, 12 Jul 2024 08:14:44 -0700

>
> But regarding the scheduler pressure, I think I have a bit of a
> different observation and the deadlock problem which is a bit
> downplayed in the current proposal is - IMHO - crucial problem to be
> solved when we want to make backfilling more accessible
>


I think we're the unfortunate victims of overloaded terminology.  The
deadlock I refer to in the AIP is not a database deadlock. The word choice
is not ideal, but that's what it's currently called in the code.  It is
referring to the scenario where no tasks can be scheduled anymore -- not
due to database deadlock, but because somehow the backfill process got into
a scenario where no tasks can be scheduled.  For example see this comment
<https://github.com/apache/airflow/blob/c09fcdf1d0e69497cf1b628df9ba3349eb688256/airflow/jobs/backfill_job_runner.py#L496-L499>
and
this comment
<https://github.com/apache/airflow/blob/c09fcdf1d0e69497cf1b628df9ba3349eb688256/airflow/jobs/backfill_job_runner.py#L730-L732>.
It's not really clear exactly how this happens.  But I have a suspicion
that it's probably more common when we use params like `task_regex` to run
backfill on only a subset of a a dag.  And in that scenario, it's easier to
imagine weird things happening.

Incidentally, along with many other params, I am proposing initially to
remove `task_regex`, and this is mentioned in the doc.  My thought is that,
I don't really like having to deal with that complexity and it's just
confusing, not super well-defined behavior, and I suspect it is the likely
cause of the deadlock concerns that can be found in the code.  And I
figured to just lead with that in the proposal and see if anyone has any
objections.

I think we should make sure that backfill runs are scheduled and
> queued in the executor in the same "scheduler loop" or get a better
> mechanism to avoid deadlocks.


Yeah same thing, this is a different kind of deadlock.  And on scheduling
in the same scheduler loop, I agree.  I did not propose otherwise.  Indeed
part of the motivator is to reduce complexity and "two ways of doing
things", and if I write a second scheduler *in the scheduler* for this I
would not be doing that!  But, there will still be *more* for the scheduler
to do.  I think the main thing will be, essentially, creating the dag
runs.  I think there will need to be a different path for creating the dag
runs, but once they are created, I expect the task instances would be
managed the same as any other task instance.

There are other comments about deadlocks but again it's a different thing
from what I call out in the proposal, and I think I agree with you and I
think we are on same page -- the tasks in a backfill should be handled in
the same mechanisms as normal tasks and therefore have the same
*database* deadlock
risks.

But that also requires some
> mechanism to avoid starvation - for example we should only allow say
> max 30% of runs scheduled and queued within a single scheduler loop to
> be backfils.


I was trying to avoid taking on a larger question about scheduling
priority.  I think that there have been some ideas around that simmering
for a while but sorta merits its own AIP.  That said, one limit mechanism
available already that I do not propose to remove is using a pool.
Interestingly right now backfill does not seem to respect the defined pool
at all, though it does have an optional pool argument.  I think we should
keep pool, but apply it as an optional override -- so the defined pool will
be respected unless it is overridden in backfill configuration.  So a user
would be able to run backfill tasks in a pool that limited their
concurrency.

One problem with things like priority weight in airflow are that they are
not forward-looking and only evaluated in the current scheduler loop with
the tasks at hand.  So you might schedule a bunch of low priority things
cus that's all that's there now to schedule; but in 2 minutes the highest
priority thing comes up and now it can't be scheduled.  This sort of
complexity is why I was hoping to avoid folding it in to this AIP, which I
think has enough on its plate.

I am open to adding some simple concurrency controls for backfill.  I think
it's a reasonable idea.  But I'm not sure exactly what would be the best
thing. But I expect some thoughts on this will materialize throughout the
course of the AIP's implementation.  And I do make small mention of this at
the bottom of the doc, that it's under consideration.  But for now it is
essentially (1) pool overrides and (2) pausing the backfill.

I think it's crucial to design and describe how the "looping" process
> should look
> like for scheduler, whether we continue having mini-scheduler and how
> backfill scheduling processing should look like.


I think mini scheduler is sort of unrelated because my expectation is that
once tasks are created they will be managed same as other tasks.

At high level, my thought is that the "backfill part" of the scheduler will
be in the dag run creation, and then where it comes to queuing and managing
tasks, it would be handled by the normal process which should remain more
or less unchanged.  I can add this to the AIP doc.

Re: [DISCUSS] AIP-78 scheduler-managed backfill

Reply via email to