trevorprater opened a new issue #9837: URL: https://github.com/apache/airflow/issues/9837
<!-- IMPORTANT!!! PLEASE CHECK "SIMILAR TO X EXISTING ISSUES" OPTION IF VISIBLE NEXT TO "SUBMIT NEW ISSUE" BUTTON!!! PLEASE CHECK IF THIS ISSUE HAS BEEN REPORTED PREVIOUSLY USING SEARCH!!! Please complete the next sections or the issue will be closed. This questions are the first thing we need to know to understand the context. --> **Apache Airflow version**: 1.10.10 **Environment**: Centos Linux 7 - **Cloud provider or hardware configuration**: - **OS** (e.g. from /etc/os-release): Centos Linux 7 - **Kernel** (e.g. `uname -a`): cannot disclose - **Install tools**: n/a - **Others**: n/a **What happened**: I am re-posting [#9735](https://github.com/apache/airflow/issues/9735) (original did not use the issue template). I have recently seen the same problem, resulting in an 800MB log file for a single task run. ``` "ERROR - LocalTaskJob heartbeat got an exception" spammed about > 30,000 times, yielding a massive log file. According to #5589 and #6284 this issue has been fixed. Both fixes were included 1.10.6, though the problem still exists. (Background on this error at: http://sqlalche.me/e/e3q8) ``` **What you expected to happen**: I would expect that the DAG would fail in a timely manner due to a lack of worker heartbeats. **How to reproduce it**: This appears to occur randomly, presumably while the database is performing poorly. I suspect this could be reproduced by overloading the DB while a DAG is running. How often does this problem occur? This problem occurs when the database becomes unreachable (rarely) The logs pasted below are from the linked issue above, not my own. In my logs, the underlying database became unavailable for some time. In the logs below, it appears the DB has too many open connections. I am using MySQL where the referenced logs are using Postgres, so maybe it is still the same root cause. <details>``` [2020-06-02 03:47:15,676] {logging_mixin.py:112} INFO - [2020-06-02 03:47:15,658] {base_job.py:205} ERROR - LocalTaskJob heartbeat got an exception Traceback (most recent call last): File "/opt/rh/rh-python36/root/usr/lib64/python3.6/site-packages/sqlalchemy/engine/base.py", line 2285, in _wrap_pool_connect return fn() File "/opt/rh/rh-python36/root/usr/lib64/python3.6/site-packages/sqlalchemy/pool/base.py", line 363, in connect return _ConnectionFairy._checkout(self) File "/opt/rh/rh-python36/root/usr/lib64/python3.6/site-packages/sqlalchemy/pool/base.py", line 773, in _checkout fairy = _ConnectionRecord.checkout(pool) File "/opt/rh/rh-python36/root/usr/lib64/python3.6/site-packages/sqlalchemy/pool/base.py", line 492, in checkout rec = pool._do_get() File "/opt/rh/rh-python36/root/usr/lib64/python3.6/site-packages/sqlalchemy/pool/impl.py", line 238, in _do_get return self._create_connection() File "/opt/rh/rh-python36/root/usr/lib64/python3.6/site-packages/sqlalchemy/pool/base.py", line 308, in _create_connection return _ConnectionRecord(self) File "/opt/rh/rh-python36/root/usr/lib64/python3.6/site-packages/sqlalchemy/pool/base.py", line 437, in init self.__connect(first_connect_check=True) File "/opt/rh/rh-python36/root/usr/lib64/python3.6/site-packages/sqlalchemy/pool/base.py", line 657, in _connect pool.logger.debug("Error on connect(): %s", e) File "/opt/rh/rh-python36/root/usr/lib64/python3.6/site-packages/sqlalchemy/util/langhelpers.py", line 69, in exit exc_value, with_traceback=exc_tb, File "/opt/rh/rh-python36/root/usr/lib64/python3.6/site-packages/sqlalchemy/util/compat.py", line 178, in raise raise exception File "/opt/rh/rh-python36/root/usr/lib64/python3.6/site-packages/sqlalchemy/pool/base.py", line 652, in __connect connection = pool._invoke_creator(self) File "/opt/rh/rh-python36/root/usr/lib64/python3.6/site-packages/sqlalchemy/engine/strategies.py", line 114, in connect return dialect.connect(*cargs, **cparams) File "/opt/rh/rh-python36/root/usr/lib64/python3.6/site-packages/sqlalchemy/engine/default.py", line 488, in connect return self.dbapi.connect(*cargs, **cparams) File "/opt/rh/rh-python36/root/usr/lib64/python3.6/site-packages/psycopg2/init.py", line 127, in connect conn = _connect(dsn, connection_factory=connection_factory, **kwasync) psycopg2.OperationalError: ERROR: no more connections allowed (max_client_conn) The above exception was the direct cause of the following exception: Traceback (most recent call last): File "/opt/rh/rh-python36/root/usr/lib/python3.6/site-packages/airflow/jobs/base_job.py", line 172, in heartbeat session.merge(self) File "/opt/rh/rh-python36/root/usr/lib64/python3.6/site-packages/sqlalchemy/orm/session.py", line 2128, in merge _resolve_conflict_map=_resolve_conflict_map, File "/opt/rh/rh-python36/root/usr/lib64/python3.6/site-packages/sqlalchemy/orm/session.py", line 2201, in merge merged = self.query(mapper.class).get(key[1]) File "/opt/rh/rh-python36/root/usr/lib64/python3.6/site-packages/sqlalchemy/orm/query.py", line 1004, in get return self._get_impl(ident, loading.load_on_pk_identity) File "/opt/rh/rh-python36/root/usr/lib64/python3.6/site-packages/sqlalchemy/orm/query.py", line 1119, in _get_impl return db_load_fn(self, primary_key_identity) File "/opt/rh/rh-python36/root/usr/lib64/python3.6/site-packages/sqlalchemy/orm/loading.py", line 284, in load_on_pk_identity return q.one() File "/opt/rh/rh-python36/root/usr/lib64/python3.6/site-packages/sqlalchemy/orm/query.py", line 3358, in one ret = self.one_or_none() File "/opt/rh/rh-python36/root/usr/lib64/python3.6/site-packages/sqlalchemy/orm/query.py", line 3327, in one_or_none ret = list(self) File "/opt/rh/rh-python36/root/usr/lib64/python3.6/site-packages/sqlalchemy/orm/query.py", line 3403, in iter return self._execute_and_instances(context) File "/opt/rh/rh-python36/root/usr/lib64/python3.6/site-packages/sqlalchemy/orm/query.py", line 3425, in _execute_and_instances querycontext, self._connection_from_session, close_with_result=True File "/opt/rh/rh-python36/root/usr/lib64/python3.6/site-packages/sqlalchemy/orm/query.py", line 3440, in _get_bind_args mapper=self._bind_mapper(), clause=querycontext.statement, **kw File "/opt/rh/rh-python36/root/usr/lib64/python3.6/site-packages/sqlalchemy/orm/query.py", line 3418, in _connection_from_session conn = self.session.connection(**kw) File "/opt/rh/rh-python36/root/usr/lib64/python3.6/site-packages/sqlalchemy/orm/session.py", line 1133, in connection execution_options=execution_options, File "/opt/rh/rh-python36/root/usr/lib64/python3.6/site-packages/sqlalchemy/orm/session.py", line 1139, in _connection_for_bind engine, execution_options File "/opt/rh/rh-python36/root/usr/lib64/python3.6/site-packages/sqlalchemy/orm/session.py", line 432, in _connection_for_bind conn = bind._contextual_connect() File "/opt/rh/rh-python36/root/usr/lib64/python3.6/site-packages/sqlalchemy/engine/base.py", line 2251, in _contextual_connect self._wrap_pool_connect(self.pool.connect, None), File "/opt/rh/rh-python36/root/usr/lib64/python3.6/site-packages/sqlalchemy/engine/base.py", line 2289, in wrap_pool_connect e, dialect, self File "/opt/rh/rh-python36/root/usr/lib64/python3.6/site-packages/sqlalchemy/engine/base.py", line 1555, in handle_dbapi_exception_noconnection sqlalchemy_exception, with_traceback=exc_info[2], from=e File "/opt/rh/rh-python36/root/usr/lib64/python3.6/site-packages/sqlalchemy/util/compat.py", line 178, in raise raise exception File "/opt/rh/rh-python36/root/usr/lib64/python3.6/site-packages/sqlalchemy/engine/base.py", line 2285, in _wrap_pool_connect return fn() File "/opt/rh/rh-python36/root/usr/lib64/python3.6/site-packages/sqlalchemy/pool/base.py", line 363, in connect return _ConnectionFairy._checkout(self) File "/opt/rh/rh-python36/root/usr/lib64/python3.6/site-packages/sqlalchemy/pool/base.py", line 773, in _checkout fairy = _ConnectionRecord.checkout(pool) File "/opt/rh/rh-python36/root/usr/lib64/python3.6/site-packages/sqlalchemy/pool/base.py", line 492, in checkout rec = pool._do_get() File "/opt/rh/rh-python36/root/usr/lib64/python3.6/site-packages/sqlalchemy/pool/impl.py", line 238, in _do_get return self._create_connection() File "/opt/rh/rh-python36/root/usr/lib64/python3.6/site-packages/sqlalchemy/pool/base.py", line 308, in _create_connection return _ConnectionRecord(self) File "/opt/rh/rh-python36/root/usr/lib64/python3.6/site-packages/sqlalchemy/pool/base.py", line 437, in init self.__connect(first_connect_check=True) File "/opt/rh/rh-python36/root/usr/lib64/python3.6/site-packages/sqlalchemy/pool/base.py", line 657, in _connect pool.logger.debug("Error on connect(): %s", e) File "/opt/rh/rh-python36/root/usr/lib64/python3.6/site-packages/sqlalchemy/util/langhelpers.py", line 69, in exit exc_value, with_traceback=exc_tb, File "/opt/rh/rh-python36/root/usr/lib64/python3.6/site-packages/sqlalchemy/util/compat.py", line 178, in raise raise exception File "/opt/rh/rh-python36/root/usr/lib64/python3.6/site-packages/sqlalchemy/pool/base.py", line 652, in __connect connection = pool._invoke_creator(self) File "/opt/rh/rh-python36/root/usr/lib64/python3.6/site-packages/sqlalchemy/engine/strategies.py", line 114, in connect return dialect.connect(*cargs, **cparams) File "/opt/rh/rh-python36/root/usr/lib64/python3.6/site-packages/sqlalchemy/engine/default.py", line 488, in connect return self.dbapi.connect(*cargs, **cparams) File "/opt/rh/rh-python36/root/usr/lib64/python3.6/site-packages/psycopg2/init.py", line 127, in connect conn = _connect(dsn, connection_factory=connection_factory, **kwasync) sqlalchemy.exc.OperationalError: (psycopg2.OperationalError) ERROR: no more connections allowed (max_client_conn) </details> ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected]
