This has been long awaited :). And we definitely need to do it.

But regarding the scheduler pressure, I think I have a bit of a
different observation and the deadlock problem which is a bit
downplayed in the current proposal is - IMHO - crucial problem to be
solved when we want to make backfilling more accessible (from UI and
API) - because this will no longer be an "exceptional" situation, that
require "special admin privileges". Backfilling will become a
"regular" operation that any user will be able to run. In a number of
situations there will be different users requesting back-filling in
parallel and we should expect that the deadlock problems we currently
experience will become much more serious.

I think we should make sure that backfill runs are scheduled and
queued in the executor in the same "scheduler loop" or get a better
mechanism to avoid deadlocks. Currently the HA scheduler has a good
mechanism of locking the DagRun with SKIP_LOCKED and as long as we
make sure it's well tested, that inside the scheduler loop the whole
DAG_RUN and related tables are actually locked for modifications
between schedulers we should be fine. But as soon as we open it for
additional operations acting in parallel on the same DAG_RUNs (and
related tables) and modify them - this opens up the deadlock gateway -
widely.

The deadlocks observed essentially come from several parallel
processes trying to access the same resources in bulk or in parallel
or both. This has a number of complications when you try to avoid such
deadlocks. Even recently we've workarounded problem where the
mini-scheduler for mapped tasks caused deadlocks. We workarounded it
by effectively SKIPPING mini-scheduler when lock cannot be immediately
obtained - but this is really a band-aid, and if we add backfills that
can be easily run and managed from the UI, at any time, this problem
will only escalate if we do not put it front and foremost as design
goal - to avoid deadlocks.

Also with AIP-72 - where there is no direct database access from the
task, I believe MINI-SCHEDULER is going to go away (Ash I am not sure
this has been discussed in AIP-72 but I believe we should explain what
is going to happen with it?). This will nicely solve the current
deadlocks it causes, but also when we add BackFills run via API, we
should make sure we do not add it back. I **THINK** the right solution
to that is to include Backfill processing in exactly the same
scheduling loop as all other events (which I think has not been
exactly specified in the AIP-78). But that also requires some
mechanism to avoid starvation - for example we should only allow say
max 30% of runs scheduled and queued within a single scheduler loop to
be backfils. This is just a first idea - and likely wrong - that came
to my mind, probably we can come up with better ones, but I think it's
a crucial to design and describe how the "looping" process should look
like for scheduler, whether we continue having mini-scheduler and how
backfill scheduling processing should look like.

Ideally - to know what we are approving here, a high level diagram of
scheduler loop, locking mechanism used and how the mechanism works in
HA environment and how it protects against deadlocks is pretty crucial
to understand what we agree to in the AIP.

J.



On Fri, Jul 12, 2024 at 12:42 PM Michał Modras
<michalmod...@google.com.invalid> wrote:
>
> I think it makes sense to orchestrate backfills in a more managed way than
> through a CLI command, especially that execution of tasks would happen
> through regular executors configured in the Airflow deployment. Once
> concern I have, also called out in the AIP, is the increased load on the
> scheduler. I think having some reliability/separation measures so that
> execution of backfill does not impact/starve regular, day-to-day
> scheduler's job, will be critical.
>
> On Wed, Jul 10, 2024 at 9:06 PM Daniel Standish
> <daniel.stand...@astronomer.io.invalid> wrote:
>
> > >
> > > Seems valid for default behaviour, but if I backfill for a year and
> > realize
> > > there was something wrong with the code, I don't want to manually fail
> > each
> > > dag run that is running. How about a force kill option?
> >
> >
> > Yes I would not expect users to need to have to go in and manually fail
> > each dag run if they wanted to cancel a backfill job.  TP was just talking
> > about pausing.  Here's how I phrased it in the doc:
> >
> >
> >    - We should be able to view backfill jobs in the webserver and observe
> >    progress and status, and cancel or pause them
> >
> >
> > There will be some details that will need to be sorted through but that's
> > the high level goal.
> >

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@airflow.apache.org
For additional commands, e-mail: dev-h...@airflow.apache.org

Reply via email to