Hi Jarek, Really appreciate the thorough information about it. I will deep dive those references.
Thanks Ping On Sat, Dec 18, 2021 at 12:50 PM Jarek Potiuk <[email protected]> wrote: > I believe scheduler's active/active horizontal scalability was one of > the last "single point of failure" we addressed for scalability. For > many years, scheduler was the only one that was not possible to scale. > We also had a number of reports from other customers that it became a > bottleneck for them. There were at least two talks about it at the > first Airflow Summit about it where our users make workarounds for > their "scheduler scalability" problems.I also personally think (and > I've seen it for a long time) - that if your system's scalability > depends on a single processor's/DB connection, this will hit you > sooner or later. So having a scalable solution where you can scale. > > However I think before you make any assumptions from your "current > use", it would be great if you look at the past discussions and > resources, and see both the context of the change and our quest of > making Airflow something different than it was before - serving more > cases that it did before and becoming a much more versatile scheduler > that can handle a lot more than what you could do with 1.10 (which > your experience is mostly about). > > There was a very extensive discussion and testing as part of the > AIP-15 when we discussed this (I think it started two years ago) and > results of the discussion and analysis are captured here: > https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=103092651 > . > I'd say your observation and case is specific to what you see is very > specific to your case and the version of Airflow you use, but in a > number of other cases this problem started to show up. Different users > have different structures of DAGs/sizes where a single scheduler > starts to show its limits. And to be honest - your case is by far not > the "biggest" one that we saw. And most importantly - not the biggest > we want to handle. Our "forward looking" is what really brought us as > a community to addressing this in the first place. > > To be perfectly honest - staying with what Airflow could do a year or > two ago is not exciting at all. Airflow 2 is all about the future, as > much as it embraces the past. We are aiming for a MUCH BIGGER scale > that you can do with the single scheduler than even what you > explained. Future of Airflow goes FAR beyond the current use cases. > Limiting Airflow to what it could do a year ago is not our goal at > all. We really want to make Airflow a much more generic scheduler that > handles way more cases - thousands of scheduled tasks per second - > possibly even handling streaming flows in the future and being able to > react to changes in fractions of seconds. For that - scalability is a > must and Ash and the Astronomer team did some very extensive testing > around the scalability approach we've chosen. And we did an extensive > review of the concept but then the code and we performed a very > detailed walk through over the code, where most active committers took > a very, very deep look into how it was done. And we had a lot of > comments, fixes and improvements (and also a number of fixes afterward > to make it robust, scalable and future-looking). Finally I also > encourage you to take a look at the fantastic talk that Ash gave at > the Airflow summit describing the decisions behind the new scheduler > architecture: https://www.youtube.com/watch?v=DYC4-xElccE. That can > give you more context of what and why was implemented there. > > You can read more about it in this article: > https://www.astronomer.io/blog/airflow-2-scheduler - including a short > write-up on what are the use cases that might benefit from the > scalability of scheduler > > So in short - yes, we think (I believe in the name of all the > community members that discussed, agreed to and took part in the > Airflow 2 effort) that active-active scheduler is a must - if not for > current scale and cases (where we think it is already useful) - then > for all the future cases that we want Airflow to excel at. > > I think soon you will start many more cases coming from those investments. > > J. > > On Fri, Dec 17, 2021 at 7:47 AM Ping Zhang <[email protected]> wrote: > > > > Hi Airflow community, > > > > I would like to share some of my thoughts on the active-active scheduler > HA mode. > > > > I am wondering whether the active-active scheduler mode is really needed > to improve the scheduler performance. > > > > One scheduler host can easily support ~5000 dags in our production with > only max scheduling delay of ~60 seconds (for the largest dag ~23K tasks) > after our Next-Gen Scheduler work. > > > > I don't see a need to set up the active-active scheduler for the > performance reason. > > > > > > Setting up the active-active scheduler mode can only increase the > complexity of cluster operations. There are also restrictions on DB, > including DB types and DB versions. > > > > I do agree that the airflow scheduler needs better HA. We could use the > active-passive mode.This can greatly simplify the scheduler code, without > needing the lock in the code and dealing with potential deadlock. > > > > We noticed that the majority of our prod incidents come from the > database. With the current active-active HA mode, it might exacerbate the > problem. > > > > Would love to hear your thoughts about this. > > > > > > Best wishes > > > > Ping Zhang >
