r-richmond commented on issue #56635:
URL: https://github.com/apache/airflow/issues/56635#issuecomment-3447976698
If it helps we have mitigated this issue somewhat. Steps we took in what we
think are the rough order of importance
1. Reduced the number of workers per api-server to 1
* Opened #57273 to improve the docs here
* When this setting was not 1 it was trivial to tip over api-server
pods. With it back at 1 it is no longer abysmal but still not good
1. Significantly increased the number of api-server replicas (3x our
original values)
* We still see pods getting locked up and subsequently reaped due to
liveness probe failures but now we have enough pods that we generally don't get
unlucky enough where they are all down at the same time
1. Implemented pg-bouncer
1. disabled sql alchemy pooling
1. reduced the automatic refresh frequency from 3 seconds to 60 seconds
1. increased pg-bouncer replicas to 2
1. Created a daily dag to delete all dag runs that are not in the 30 most
recent for any given dag.
* ~17k dag runs in metadb to ~3k now
Misc things we found along the way
1. As mentioned by @trau-sca simply lightly using the fronted can lock up
the api-server pods causing them to fail the liveness probe leading to pod
restarts and all running tasks getting marked as failed with a state mismatch
error
2. The docs around api_worker recommendations were not strong enough.
Misc Thoughts
1. Having the api-server handle both the fronted & the backend/rest api has
resulted in AF3 being more brittle than before. As now if there is a bug in the
frontend that locks up the api-server pods it can lead to outages that cause
the workers to disconnect / fail their tasks.
2. It would be nice if there was a way to disconnect the two. For now I
think just a way to segregate some number of api-server replicas as for the
backend (~workers/triggerer/processor) only would suffice.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]