george-zubrienko commented on issue #30884: URL: https://github.com/apache/airflow/issues/30884#issuecomment-1525448460
> decrease the job_heartbeat_sec -> 60 seconds seems a bit excessive, scheduler should rarely run scheduler lopp for longer than a few seconds, do you know why you have that long of a heartbeat expectations ? As I recall, we used to have a lot of issues with tasks (k8s pods) receiving SIGTERM in high load periods because they didn't heartbeat fast enough (at least that was our theory), then we figured out that was somehow related to database communication. First we tried to play around with pgbouncer settings - increasing number of allowed client connections and pool size, bumped database from 2 core / 8g to 4 core / 16g and base iops from like 100 to >2000, which helped for like a couple of months. Then people produced more dags, more tasks and it seemed unreasonable to use that we need to bump database again, and pgbouncer was already running with 1k connection pool size. Then we found out that increasing that setting significantly reduces number of database sessions and our problems with dying tasks were resolved. I should probably have opened an issue on that as well, but we were so happy our models can be trained again, we sort of let it slip somewhere in the backlog. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
