Hey Ping,

I don't think there is nearly enough information in what you described on
what "pluggable" scheduler means. What I see in the doc and your
description is a problem statement, but not a solution.

But please don't jump on trying to describe it just yet.

I am very skeptical about the idea in general. if you remove scheduler.
from Ariflow Core, there is not much left. Airflow core is practically
speaking only a scheduler when-  it comes to the internals. I almost think
that if someone wants to make "scheduler" pluggable - that calls for
forking Airflow - and forking airflow (if someone wants to do it and
developing for themselves will be far more effective than trying to get a
"pluggable" architecture. Also because this is at most a tactical solution
IMHO.

I believe a lot of this problem statement of yours is "past looking" rather
than "forward looking" and it does not include the fact that not only
Airflow changes but that the environment around it changes. And by the time
the result of any design and development of any such pluggable solution
might be even close to completion, the environment will change already and
we will be in a different place - both in terms of what Airflow will be
capable of and what external environment will be.

MySQL 5.7 EOL is in a year - October 2023. And we will surely drop it then.
And anyone using it should. Then the single scheduler only case will be
gone (we will not support 5.7 anyway). I seriously doubt within a year we
can develop "another" scheduler and even if we do, if the only reason for
it is not supporting multiple schedulers, that would be the first thing to
drop in October 2023. And if we have a pluggable scheduler with 2
implementations by October 2023, I will be the first one to raise "let's
drop this non-locking scheduler as we don't need it any more". If you look
back - and imagine we are back in January 2019 and we would keep
compatibility with Python 2.7. We would not be where we are now. And we
need to look into the modern, new Airflow future rather than looking at
some bad and discouraged ways of using Airflow. Even more, we should
encourage and help the users that are using Airflow in those
"non-future-proof" ways to switch to use the new ways. And add features
that make current scheduler more appealing for the cases you mentioned

Also the 2.4 of Airflow brings Datasets and Data-driven scheduling. And as
surprising as it might look, it will generally solve the "big DAG/small
DAG" problem. Simply speaking, the DAGs of Airflow will suddenly start
becoming more modular and you will be able to do the same you did with huge
1000 tasks dags with 50 20-tasks dags which will be connected via datasets.
this will be far better, more modular solution. And rather than
complicating Airflow by designing and implementing  multiple schedulers, I
would rather focus on developing tooling that will make distributed DAG
development far more appealing for any users. And those users (like AirBnB
- with huge DAGs) should follow the suite in changing their approach - this
will give them far more capabilities, will enable them to distribute DAG
development and manage it way better than having a huge, simple DAG

Maybe instead of adding pluggable schedulers, we should rather (after 2.4)
work on a tooling that will help users with huge DAGs to split them. Maybe
we should add a way to prioritise DagRuns ? Both of those are much more
forward-looking than trying to "cement" existing (bad) usage patterns IMHO
by making them "blessed" by having a 2nd type of scheduler supporting those
cases that should be solved differently.

That's how I see it.

J.


On Tue, Aug 23, 2022 at 7:46 AM Ping Zhang <[email protected]> wrote:

> Hi Airflow community,
>
> We are proposing to have the Airflow Scheduler adopt a pluggable pattern,
> similar to the executor.
>
> Background:
>
> Airflow 2.0 has introduced a new scheduler in AIP-15 (Scheduler HA +
> performance improvement)
> <https://airflow.apache.org/blog/airflow-two-point-oh-is-here/#massive-scheduler-performance-improvements>.
> The new scheduler leverages the skip-locked feature in the database to
> scale horizontally
> <https://airflow.apache.org/docs/apache-airflow/stable/concepts/scheduler.html#overview>.
> It works well for relatively small clusters (small number of tasks in a dag
> and small number of dag files) as shown in the benchmark results from the
> community:
>
> Scenario (1000 tasks in total)
>
> DAG shape
>
> 1.10.10 Total Task Lag
>
> 2.0 beta Total Task Lag
>
> Speedup
>
> 100 DAG files, 1 DAG per file,
>
> 10 Tasks per DAG
>
> Linear
>
> 200 seconds
>
> 11.6 seconds
>
> 17 times
>
> 10 DAG files, 1 DAG per file,
>
> 100 Tasks per DAG
>
> Linear
>
> 144 seconds
>
> 14.3 seconds
>
> 10 times
>
> 10 DAG files, 10 DAGs per file,
>
> 10 Tasks per DAG
>
> Binary Tree
>
> 200 seconds
>
> 12 seconds
>
> 16 times
>
> From: https://www.astronomer.io/blog/airflow-2-scheduler
>
> From the most recent 2022 Airflow survey
> <https://docs.google.com/document/d/18E3gBbrPI6cHAKRkRIPfju9pOk4EJNd2M-1fRJO4glA/edit#heading=h.yhlzd4j2mpzz>,
> 81% of the Airflow users have between 1 to 250 DAGs in their largest
> Airflow instance (4.8% of users have more than 1000 DAGs). 75% of the
> surveyed Airflow users have between 1 to 100 tasks per DAG. The Airflow 2.0
> scheduler can satisfy these needs.
>
> However, there are cases where the Airflow 2.0 scheduler cannot be
> deployed due to:
>
>    1.
>
>    The team cannot use more than one scheduler due to the company’s
>    database team not supporting mysql 8+ or postgresql 10+. (Arguably, it is
>    true that they should be supported but in reality, it can take quite a
>    while for large companies to upgrade to newer db versions)
>    2.
>
>    Airflow 2.0 treats all DagRuns with the same scheduling priority (see
>    code
>    
> <https://github.com/apache/airflow/blob/6b7a343b25b06ab592f19b7e70843dda2d7e0fdb/airflow/jobs/scheduler_job.py#L923>).
>    This means DAGs with more DagRuns could be scheduled more often than others
>    and large DAGs might slow down small DAGs scheduling. This may not be
>    desired in some cases.
>    3.
>
>    For very large scale clusters (with more than 10 million rows in the
>    task instance table), the database tends to be the unstable component. The
>    infra team does not want to add extra load to the database with more than
>    one scheduler. However, with only one Airflow 2.0 scheduler, it cannot
>    support large scale clusters as it has removed the multi-processing dag
>    runs and only uses one core to schedule all dag runs
>    
> <https://github.com/apache/airflow/blob/6b7a343b25b06ab592f19b7e70843dda2d7e0fdb/airflow/jobs/scheduler_job.py#L886-L976>
>    .
>
> The above limitations hinder evolving Airflow as a general purpose
> scheduling platform.
>
> To address the above limitations and avoid making the scheduler core code
> larger and logic more complex, we propose to have a pluggable scheduler
> pattern. With that, the Airflow infra team/users can choose the best
> scheduler to satisfy their needs and even swap parts that need
> customization to achieve their best interest.
>
> Please let me know your thoughts about this and look forward to feedback.
>
> (Here is the google doc link,
> https://docs.google.com/document/d/1njmX3D_9a4TjjG9CYPWJqdkb9EyXkeQPnycYaMTUQ_s/edit?usp=sharing
> feel free to comment it in the doc)
>
> Thanks,
>
> Ping
>
>

Reply via email to