george-zubrienko commented on issue #30884:
URL: https://github.com/apache/airflow/issues/30884#issuecomment-1525448460

   > decrease the job_heartbeat_sec -> 60 seconds seems a bit excessive, 
scheduler should rarely run scheduler lopp for longer than a few seconds, do 
you know why you have that long of a heartbeat expectations ?
   
   As I recall, we used to have a lot of issues with tasks (k8s pods) receiving 
SIGTERM in high load periods because they didn't heartbeat fast enough (at 
least that was our theory), then we figured out that was somehow related to 
database communication. First we tried to play around with pgbouncer settings - 
increasing number of allowed client connections and pool size, bumped database 
from 2 core / 8g to 4 core / 16g and base iops from like 100 to >2000, which 
helped for like a couple of months. Then people produced more dags, more tasks 
and it seemed unreasonable to use that we need to bump database again, and 
pgbouncer was already running with 1k connection pool size.
   
   Then we found out that increasing that setting significantly reduces number 
of database sessions and our problems with dying tasks were resolved. I should 
probably have opened an issue on that as well, but we were so happy our models 
can be trained again, we sort of let it slip somewhere in the backlog.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to