[GitHub] [airflow] ephraimbuddy commented on issue #18011: Task stuck in upstream_failed

GitBox Sun, 19 Sep 2021 15:23:17 -0700


ephraimbuddy commented on issue #18011:
URL: https://github.com/apache/airflow/issues/18011#issuecomment-922546341



   > Hi @ephraimbuddy - I work with @WattsInABox. We don't see `FATAL: sorry, 
too many clients already.` but we do see:
   > 
   > ```
   > Traceback (most recent call last):
   >   File 
"/opt/app-root/lib64/python3.8/site-packages/airflow/jobs/base_job.py", line 
202, in heartbeat
   >     session.merge(self)
   >   File 
"/opt/app-root/lib64/python3.8/site-packages/sqlalchemy/orm/session.py", line 
2166, in merge
   >     return self._merge(
   >   File 
"/opt/app-root/lib64/python3.8/site-packages/sqlalchemy/orm/session.py", line 
2244, in _merge
   >     merged = self.query(mapper.class_).get(key[1])
   >   File 
"/opt/app-root/lib64/python3.8/site-packages/sqlalchemy/orm/query.py", line 
1018, in get
   >     return self._get_impl(ident, loading.load_on_pk_identity)
   > 
   > ....
   > 
   > psycopg2.OperationalError: could not connect to server: Connection timed 
out
   > ```
   > 
   > This causes the job to be SIGTERM'ed (most of the time, it seems). The 
tasks will now retry since we have #16301, and will eventually succeed. 
Sometimes it is SIGTERM'ed 5 times or more before success - which is not ideal 
for tasks that take an hour plus. I suspect also at times this results in the 
downstream tasks being set to upstream_failed when in fact the upstream is all 
successful - but I can't prove it.
   > 
   > We tried to bump the `AIRFLOW__SCHEDULER__JOB_HEARTBEAT_SEC` to `60` to 
maybe ease up on hitting the database with no luck. This error also happens 
when only a couple DAGs are running so there is not much load on our nodes or 
the database. We don't think it's a networking issue.
   > 
   > Our pool sqlalchemy pool size is 350, this might be high - but my 
understanding is the pool does not create connections until they are needed, 
and according to AWS monitoring the max connections we ever hit at peak time is 
~300-370 which should be totally manageable on our `db.m6g.4xlarge` instance. 
However, if it's a 350 pool for each worker and each worker opens tons of 
connections that are then alive in the pool - perhaps we are exhausting PG 
memory
   > 
   > Do you have any additional advice on things to try?
   
   In 2.1.4 we added some limits( to the number of queued dagruns the scheduler 
can create and I'm suspecting that the issue we have on database connections 
will go with it. I was having `FATAL: sorry, too many clients already.` db 
error until the queued dagruns was limited in this PR 
https://github.com/apache/airflow/pull/18065.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [airflow] ephraimbuddy commented on issue #18011: Task stuck in upstream_failed

Reply via email to