r-richmond commented on issue #56635:
URL: https://github.com/apache/airflow/issues/56635#issuecomment-3447976698

   If it helps we have mitigated this issue somewhat. Steps we took in what we 
think are the rough order of importance
   
   1. Reduced the number of workers per api-server to 1
       * Opened #57273 to improve the docs here
       * When this setting was not 1 it was trivial to tip over api-server 
pods. With it back at 1 it is no longer abysmal but still not good
   1. Significantly increased the number of api-server replicas (3x our 
original values)
       * We still see pods getting locked up and subsequently reaped due to 
liveness probe failures but now we have enough pods that we generally don't get 
unlucky enough where they are all down at the same time
   1. Implemented pg-bouncer
   1. disabled sql alchemy pooling
   1. reduced the automatic refresh frequency from 3 seconds to 60 seconds
   1. increased pg-bouncer replicas to 2
   1. Created a daily dag to delete all dag runs that are not in the 30 most 
recent for any given dag.
       * ~17k dag runs in metadb to ~3k now
   
   Misc things we found along the way
   
   1. As mentioned by @trau-sca simply lightly using the fronted can lock up 
the api-server pods causing them to fail the liveness probe leading to pod 
restarts and all running tasks getting marked as failed with a state mismatch 
error
   2. The docs around api_worker recommendations were not strong enough.
   
   Misc Thoughts
   
   1. Having the api-server handle both the fronted & the backend/rest api has 
resulted in AF3 being more brittle than before. As now if there is a bug in the 
frontend that locks up the api-server pods it can lead to outages that cause 
the workers to disconnect / fail their tasks.
   2. It would be nice if there was a way to disconnect the two. For now I 
think just a way to segregate some number of api-server replicas as for the 
backend (~workers/triggerer/processor) only would suffice.
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to