taylorfinnell commented on issue #18011:
URL: https://github.com/apache/airflow/issues/18011#issuecomment-922384877


   Hi @ephraimbuddy - I work with @WattsInABox. We don't see `FATAL: sorry, too 
many clients already.` but we do see:
   
   ```
   Traceback (most recent call last):
     File 
"/opt/app-root/lib64/python3.8/site-packages/airflow/jobs/base_job.py", line 
202, in heartbeat
       session.merge(self)
     File 
"/opt/app-root/lib64/python3.8/site-packages/sqlalchemy/orm/session.py", line 
2166, in merge
       return self._merge(
     File 
"/opt/app-root/lib64/python3.8/site-packages/sqlalchemy/orm/session.py", line 
2244, in _merge
       merged = self.query(mapper.class_).get(key[1])
     File 
"/opt/app-root/lib64/python3.8/site-packages/sqlalchemy/orm/query.py", line 
1018, in get
       return self._get_impl(ident, loading.load_on_pk_identity)
   
   ....
   
   psycopg2.OperationalError: could not connect to server: Connection timed out
   ```
   
   This causes the job to be SIGTERM'ed (most of the time, it seems). The tasks 
will now retry since we have #16301, and will eventually succeed. Sometimes it 
is SIGTERM'ed 5 times or more before success - which is not ideal for tasks 
that take an hour plus. I suspect also at times this results in the downstream 
tasks being set to upstream_failed when in fact the upstream is all successful 
- but I can't prove it.
   
   We tried to bump the `AIRFLOW__SCHEDULER__JOB_HEARTBEAT_SEC` to `60` to 
maybe ease up on hitting the database with no luck. This error also happens 
when only a couple DAGs are running so there is not much load on our nodes or 
the database. We don't think it's a networking issue.
   
   Our pool sqlalchemy pool size is 350, this might be high - but my 
understanding is the pool does not create connections until they are needed, 
and according to AWS monitoring the max connections we ever hit at peak time is 
~300-370 which should be totally manageable on our `db.m6g.4xlarge` instance.
   
   Do you have any additional advice on things to try? 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to