It could potentially be done, but I believe it would require quite a bit of
work on re-architecting the way on how the database is accessed by Airflow
and is not compatible with Airflow Architectes - we would probably not be
able to gain significant scaling improvements anyway.

For now either Scheduler or Workers can both read and write to the DB,
there are no pieces of Airflow that we can clearly separate as "read-only".
The pattern you describe Valeriy makes sense if you have some components of
the system that can rely on read-only access and then they could indeed
read from the replicas. But to my understanding how the database is used
in Airflow, this is not really the pattern we can have easily.
Airflow currently uses a lot of DB locks as the HA/scalability mechanisms
and the logic of DB writes is distributed among internal
scheduler components as well as workers - it means that a lot of queries/DB
access are read-write rather than read-only. In case of leader-> follower
replication all writes have to go to the leader so if you do not have
clearly separated "read-only" components, it brings no benefits at all.

Simply not all software architectures play well with such leader-> follower
patterns. I would love to understand Valeriy - do you have any observations
and statistics that led you to the conclusion that we could improve
horizontal scaling by using read replicas? Is it evidence based or more of
a "general" statement without prior looking at how Airflow uses the DB?
Do you have an example of queries/patterns of airflow usage of the DB that
could benefit from it? I'd love to hear your experiences with it?

I think even if there are such cases in Airflow where we have vast amounts
of reads coming from one place without accompanying writes, they might come
from mis-use of Airflow rather than from the internal "Db access
architecture". For example there is one usage pattern that potentially
generates a lot of reads, but this is actually an anti-pattern of
Airflow usage. If you are using VARIABLES in the DAG construction phase,
that can lead to a lot of reads generated for the variables table while
DAGs are continuously being parsed by Airflow Scheduler, but this is
definitely an Anti-Pattern
http://airflow.apache.org/docs/apache-airflow/stable/best-practices.html#variables

One other place that "could" potentially benefit from using read-replicas
is the webserver. Webserver is "mostly read" by its nature. However
Airflow's webserver "traffic" to databse is super-small comparing to the
traffic generated by scheduler and workers. There are usually just a
handful of people using the Airflow webserver  and the frequency/amount of
queries generated by combined usage of all the users would be few orders of
magnitude less than the rest (I believe) so making it read-replica-aware
would bring very little improvements (if any)

J.


On Sun, Mar 14, 2021 at 11:58 AM Xiaodong Deng <[email protected]> wrote:

> This is definitely a good question. But to my understanding/opinion, this
> should be addressed at the database side, and Airflow does not need to know
> anything behind it or make any change (it simply accepts a DB connection
> URI as *sql_alchemy_conn* in the configuration).
>
> Some related infor/discussions I can find, for your information:
> - Postgres:
> https://www.postgresql.org/message-id/CANu8Fiy_TFdFAbCt-snjwFP%2BJJs%2B%2B08r1X5Aincf-rb1ikDW%2Bg%40mail.gmail.com
> - MariaDB MaxScale (NOTE: MariaDB is a fork of MySQL; MaxScale is
> MariaDB's proxy extension; For Airflow, MariaDB is NOT tested/recommended.
> Here I just take this as an example for your question):
> https://mariadb.com/kb/en/mariadb-maxscale-24-readwritesplit/#readwritesplit-routing-decisions
>
>
> Let me know if I missed or misunderstood anything.
>
>
> XD
>
> On Sun, Mar 14, 2021 at 10:41 AM Valeriy Solovyov <[email protected]>
> wrote:
>
>> Hello,
>>
>> It would be nice if we will be able to offload the read load from DB
>> Leader by horizontal scaling, which relies on adding more database
>> servers to a cluster to fulfill requests and splitting traffic between them.
>>
>> For example, we have the database Mysql Leader. Leader StandBy and 5
>> Followers.
>> And HAProxy exposes Readony access to databases on one port and ReadWrite
>> access on another.
>>
>>
>>

-- 
+48 660 796 129

Reply via email to