This has been long awaited :). And we definitely need to do it. But regarding the scheduler pressure, I think I have a bit of a different observation and the deadlock problem which is a bit downplayed in the current proposal is - IMHO - crucial problem to be solved when we want to make backfilling more accessible (from UI and API) - because this will no longer be an "exceptional" situation, that require "special admin privileges". Backfilling will become a "regular" operation that any user will be able to run. In a number of situations there will be different users requesting back-filling in parallel and we should expect that the deadlock problems we currently experience will become much more serious.
I think we should make sure that backfill runs are scheduled and queued in the executor in the same "scheduler loop" or get a better mechanism to avoid deadlocks. Currently the HA scheduler has a good mechanism of locking the DagRun with SKIP_LOCKED and as long as we make sure it's well tested, that inside the scheduler loop the whole DAG_RUN and related tables are actually locked for modifications between schedulers we should be fine. But as soon as we open it for additional operations acting in parallel on the same DAG_RUNs (and related tables) and modify them - this opens up the deadlock gateway - widely. The deadlocks observed essentially come from several parallel processes trying to access the same resources in bulk or in parallel or both. This has a number of complications when you try to avoid such deadlocks. Even recently we've workarounded problem where the mini-scheduler for mapped tasks caused deadlocks. We workarounded it by effectively SKIPPING mini-scheduler when lock cannot be immediately obtained - but this is really a band-aid, and if we add backfills that can be easily run and managed from the UI, at any time, this problem will only escalate if we do not put it front and foremost as design goal - to avoid deadlocks. Also with AIP-72 - where there is no direct database access from the task, I believe MINI-SCHEDULER is going to go away (Ash I am not sure this has been discussed in AIP-72 but I believe we should explain what is going to happen with it?). This will nicely solve the current deadlocks it causes, but also when we add BackFills run via API, we should make sure we do not add it back. I **THINK** the right solution to that is to include Backfill processing in exactly the same scheduling loop as all other events (which I think has not been exactly specified in the AIP-78). But that also requires some mechanism to avoid starvation - for example we should only allow say max 30% of runs scheduled and queued within a single scheduler loop to be backfils. This is just a first idea - and likely wrong - that came to my mind, probably we can come up with better ones, but I think it's a crucial to design and describe how the "looping" process should look like for scheduler, whether we continue having mini-scheduler and how backfill scheduling processing should look like. Ideally - to know what we are approving here, a high level diagram of scheduler loop, locking mechanism used and how the mechanism works in HA environment and how it protects against deadlocks is pretty crucial to understand what we agree to in the AIP. J. On Fri, Jul 12, 2024 at 12:42 PM Michał Modras <michalmod...@google.com.invalid> wrote: > > I think it makes sense to orchestrate backfills in a more managed way than > through a CLI command, especially that execution of tasks would happen > through regular executors configured in the Airflow deployment. Once > concern I have, also called out in the AIP, is the increased load on the > scheduler. I think having some reliability/separation measures so that > execution of backfill does not impact/starve regular, day-to-day > scheduler's job, will be critical. > > On Wed, Jul 10, 2024 at 9:06 PM Daniel Standish > <daniel.stand...@astronomer.io.invalid> wrote: > > > > > > > Seems valid for default behaviour, but if I backfill for a year and > > realize > > > there was something wrong with the code, I don't want to manually fail > > each > > > dag run that is running. How about a force kill option? > > > > > > Yes I would not expect users to need to have to go in and manually fail > > each dag run if they wanted to cancel a backfill job. TP was just talking > > about pausing. Here's how I phrased it in the doc: > > > > > > - We should be able to view backfill jobs in the webserver and observe > > progress and status, and cancel or pause them > > > > > > There will be some details that will need to be sorted through but that's > > the high level goal. > > --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@airflow.apache.org For additional commands, e-mail: dev-h...@airflow.apache.org