Re: [DISCUSS] Scheduler does unnecessary processing when there are very large scheduled dags

asquator Sat, 18 Oct 2025 08:00:12 -0700

It is a complication, but it seems as we can't do any better and remain 
scalable. In the end we want priorities enforced (mb not in the way they're 
implemented today, but it's part of another talk), and we don't know how many 
tasks we'll have to iterate over in advance, so fetching them into python is a 
death sentence in some situations (not joking, tried that with fetchmany and 
chunked streaming, it was way too slow).

I actually thought of another optimization here:
Instead of fetching the entire TI relation, we can ignore mapped tasks and only 
fetch individual tasks (operators), expanding them on the fly into the maximum 
number of TIs that can be created. And yet this approach is not scalable, as 
some enormous backfill of a DAG with just 10 tasks will make it fetch MBs of 
data every time. It's very slow and loads the DB server with heavy network 
requests.

Well, it's not just about throughput, but starvation of tasks that can't run 
for hours sometimes, and unfortunately we encounter this in production very 
often.

On Wednesday, September 24th, 2025 at 3:10 AM, Matthew Phillips 
<[email protected]> wrote:

> Hi,
> This seems like a significant level of technical complication/debt relative
> to even a 1.5x/2x gain (which as noted is only in certain workloads).
> Given airflow scheduling code in general is not something one could
> describe as simple, introducing large amounts of static code that lives in
> stored procs seems unwise.If at all possible making this
> interface pluggable and provided via provider would be the saner approach
> in my opinion.
> 
> On Tue, Sep 23, 2025 at 11:16 AM asquator [email protected] wrote:
> 
> > Hello,
> > 
> > A new approach utilizing stored SQL functions is proposed here as a
> > solution to unnecessary processing/starvation:
> > 
> > https://github.com/apache/airflow/pull/55537
> > 
> > Benchmarks show an actual improvement in queueing throughput, around
> > 1.5x-2x for the workloads tested.
> > 
> > Regarding DB server load, I wasn't able to note any difference so far, we
> > probably have to run heavier jobs to test that. Looks like a smooth line to
> > me.
> > 
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: [email protected]
> > For additional commands, e-mail: [email protected]

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [DISCUSS] Scheduler does unnecessary processing when there are very large scheduled dags

Reply via email to