[ https://issues.apache.org/jira/browse/AIRFLOW-191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15308343#comment-15308343 ]
Sergei Iakhnin commented on AIRFLOW-191: ---------------------------------------- You are right about celery. I think the root cause is that db connections are not always appropriately disposed of. I.e. I've made a patch where I close the session and dispose of the SQLAlchemy engine every time, and this brings the issue under control somewhat, although this can't be the long term solution. It seems too expensive that job heartbeats should consume as many db connections as there are cores in the cluster. Maybe they could all send their heartbeats to a service that runs on the manager node, and the service can manage a modestly sized connection pool? > Database connection leak on Postgresql backend > ---------------------------------------------- > > Key: AIRFLOW-191 > URL: https://issues.apache.org/jira/browse/AIRFLOW-191 > Project: Apache Airflow > Issue Type: Bug > Components: executor > Affects Versions: Airflow 1.7.1 > Reporter: Sergei Iakhnin > > I raised this issue on github several months ago and there was even a PR but > it never maid it into mainline. Basically, workers tend to hang onto DB > connections in Postgres for recording heartbeat. > I'm running a cluster with 115 workers, each with 8 slots. My Postgres DB is > configured to allow 1000 simultaneous connections. I should effectively be > able to run 920 tasks at the same time, but am actually limited to only about > 450-480 because of idle transactions from workers hanging on to DB > connections. > If I run the following query > select count(*),state, client_hostname from pg_stat_activity group by state, > client_hostname > These are the results: > count state client_hostname > 1 active (null) > 1 idle localhost > 451 idle in transaction (null) > 446 idle (null) > 1 active localhost > The idle connections are all trying to run COMMIT > The "idle in transaction" connections are all trying to run > SELECT job.id AS job_id, job.dag_id AS job_dag_id, job.state AS job_state, > job.job_type AS job_job_type, job.start_date AS job_start_date, job.end_date > AS job_end_date, job.latest_heartbeat AS job_latest_heartbeat, > job.executor_class AS job_executor_class, job.hostname AS job_hostname, > job.unixname AS job_unixname > FROM job > WHERE job.id = 213823 > LIMIT 1 > with differing job.ids of course. -- This message was sent by Atlassian JIRA (v6.3.4#6332)