ephraimbuddy commented on issue #18011: URL: https://github.com/apache/airflow/issues/18011#issuecomment-922546341
> Hi @ephraimbuddy - I work with @WattsInABox. We don't see `FATAL: sorry, too many clients already.` but we do see: > > ``` > Traceback (most recent call last): > File "/opt/app-root/lib64/python3.8/site-packages/airflow/jobs/base_job.py", line 202, in heartbeat > session.merge(self) > File "/opt/app-root/lib64/python3.8/site-packages/sqlalchemy/orm/session.py", line 2166, in merge > return self._merge( > File "/opt/app-root/lib64/python3.8/site-packages/sqlalchemy/orm/session.py", line 2244, in _merge > merged = self.query(mapper.class_).get(key[1]) > File "/opt/app-root/lib64/python3.8/site-packages/sqlalchemy/orm/query.py", line 1018, in get > return self._get_impl(ident, loading.load_on_pk_identity) > > .... > > psycopg2.OperationalError: could not connect to server: Connection timed out > ``` > > This causes the job to be SIGTERM'ed (most of the time, it seems). The tasks will now retry since we have #16301, and will eventually succeed. Sometimes it is SIGTERM'ed 5 times or more before success - which is not ideal for tasks that take an hour plus. I suspect also at times this results in the downstream tasks being set to upstream_failed when in fact the upstream is all successful - but I can't prove it. > > We tried to bump the `AIRFLOW__SCHEDULER__JOB_HEARTBEAT_SEC` to `60` to maybe ease up on hitting the database with no luck. This error also happens when only a couple DAGs are running so there is not much load on our nodes or the database. We don't think it's a networking issue. > > Our pool sqlalchemy pool size is 350, this might be high - but my understanding is the pool does not create connections until they are needed, and according to AWS monitoring the max connections we ever hit at peak time is ~300-370 which should be totally manageable on our `db.m6g.4xlarge` instance. However, if it's a 350 pool for each worker and each worker opens tons of connections that are then alive in the pool - perhaps we are exhausting PG memory > > Do you have any additional advice on things to try? In 2.1.4 we added some limits( to the number of queued dagruns the scheduler can create and I'm suspecting that the issue we have on database connections will go with it. I was having `FATAL: sorry, too many clients already.` db error until the queued dagruns was limited in this PR https://github.com/apache/airflow/pull/18065. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
