Hi Jarek,

Really appreciate the thorough information about it. I will deep dive those
references.

Thanks

Ping


On Sat, Dec 18, 2021 at 12:50 PM Jarek Potiuk <[email protected]> wrote:

> I believe scheduler's active/active horizontal scalability was one of
> the last "single point of failure" we addressed for scalability. For
> many years, scheduler was the only one that was not possible to scale.
> We also had a number of reports from other customers that it became a
> bottleneck for them. There were at least two talks about it at the
> first Airflow Summit about it where our users make workarounds for
> their "scheduler scalability" problems.I also personally think (and
> I've seen it for a long time) - that if your system's scalability
> depends on a single processor's/DB connection, this will hit you
> sooner or later. So having a scalable solution where you can scale.
>
> However I think before you make any assumptions from your "current
> use", it would be great if you look at the past discussions and
> resources, and see both the context of the change and our quest of
> making Airflow something different than it was before - serving more
> cases that it did before and becoming a much more versatile scheduler
> that can handle a lot more than what you could do with 1.10 (which
> your experience is mostly about).
>
> There was a very extensive discussion and testing as part of the
> AIP-15 when we discussed this (I think it started two years ago) and
> results of the discussion and analysis are captured here:
> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=103092651
> .
> I'd say your observation and case is specific to what you see is very
> specific to your case and the version of Airflow you use, but in a
> number of other cases this problem started to show up. Different users
> have different structures of DAGs/sizes where a single scheduler
> starts to show its limits. And to be honest - your case is by far not
> the "biggest" one that we saw. And most importantly - not the biggest
> we want to handle. Our "forward looking" is what really brought us as
> a community to addressing this in the first place.
>
> To be perfectly honest - staying with what Airflow could do a year or
> two ago is not exciting at all. Airflow 2 is all about the future, as
> much as it embraces the past. We are aiming for a MUCH BIGGER scale
> that you can do with the single scheduler than even what you
> explained. Future of Airflow goes FAR beyond the current use cases.
> Limiting Airflow to what it could do a year ago is not our goal at
> all. We really want to make Airflow a much more generic scheduler that
> handles way more cases - thousands of scheduled tasks per second -
> possibly even handling streaming flows in the future and being able to
> react to changes in fractions of seconds. For that - scalability is a
> must and Ash and the Astronomer team did some very extensive testing
> around the scalability approach we've chosen. And we did an extensive
> review of the concept but then the code and we performed a very
> detailed walk through over the code, where most active committers took
> a very, very deep look into how it was done. And we had a lot of
> comments, fixes and improvements (and also a number of fixes afterward
> to make it robust, scalable and future-looking). Finally I also
> encourage you to take a look at the fantastic talk that Ash gave at
> the Airflow summit describing the decisions behind the new scheduler
> architecture: https://www.youtube.com/watch?v=DYC4-xElccE. That can
> give you more context of what and why was implemented there.
>
> You can read more about it in this article:
> https://www.astronomer.io/blog/airflow-2-scheduler - including a short
> write-up on what are the use cases that might benefit from the
> scalability of scheduler
>
> So in short - yes, we think (I believe in the name of all the
> community members that discussed, agreed to and took part in the
> Airflow 2 effort) that active-active scheduler is a must - if not for
> current scale and cases (where we think it is already useful) - then
> for all the future cases that we want Airflow to excel at.
>
> I think soon you will start many more cases coming from those investments.
>
> J.
>
> On Fri, Dec 17, 2021 at 7:47 AM Ping Zhang <[email protected]> wrote:
> >
> > Hi Airflow community,
> >
> > I would like to share some of my thoughts on the active-active scheduler
> HA mode.
> >
> > I am wondering whether the active-active scheduler mode is really needed
> to improve the scheduler performance.
> >
> > One scheduler host can easily support ~5000 dags in our production with
> only max scheduling delay of ~60 seconds (for the largest dag ~23K tasks)
> after our Next-Gen Scheduler work.
> >
> > I don't see a need to set up the active-active scheduler for the
> performance reason.
> >
> >
> > Setting up the active-active scheduler mode can only increase the
> complexity of cluster operations. There are also restrictions on DB,
> including DB types and DB versions.
> >
> > I do agree that the airflow scheduler needs better HA. We could use the
> active-passive mode.This can greatly simplify the scheduler code, without
> needing the lock in the code and dealing with potential deadlock.
> >
> > We noticed that the majority of our prod incidents come from the
> database. With the current active-active HA mode, it might exacerbate the
> problem.
> >
> > Would love to hear your thoughts about this.
> >
> >
> > Best wishes
> >
> > Ping Zhang
>

Reply via email to