[ 
https://issues.apache.org/jira/browse/AIRFLOW-191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15308343#comment-15308343
 ] 

Sergei Iakhnin commented on AIRFLOW-191:
----------------------------------------

You are right about celery. 

I think the root cause is that db connections are not always appropriately 
disposed of. I.e. I've made a patch where I close the session and dispose of 
the SQLAlchemy engine every time, and this brings the issue under control 
somewhat, although this can't be the long term solution.

It seems too expensive that job heartbeats should consume as many db 
connections as there are cores in the cluster. Maybe they could all send their 
heartbeats to a service that runs on the manager node, and the service can 
manage a modestly sized connection pool?



> Database connection leak on Postgresql backend
> ----------------------------------------------
>
>                 Key: AIRFLOW-191
>                 URL: https://issues.apache.org/jira/browse/AIRFLOW-191
>             Project: Apache Airflow
>          Issue Type: Bug
>          Components: executor
>    Affects Versions: Airflow 1.7.1
>            Reporter: Sergei Iakhnin
>
> I raised this issue on github several months ago and there was even a PR but 
> it never maid it into mainline. Basically, workers tend to hang onto DB 
> connections in Postgres for recording heartbeat.
> I'm running a cluster with 115 workers, each with 8 slots. My Postgres DB is 
> configured to allow 1000 simultaneous connections. I should effectively be 
> able to run 920 tasks at the same time, but am actually limited to only about 
> 450-480 because of idle transactions from workers hanging on to DB 
> connections.
> If I run the following query
> select count(*),state, client_hostname from pg_stat_activity group by state, 
> client_hostname
> These are the results:
> count state client_hostname
> 1     active  (null)
> 1     idle    localhost
> 451   idle in transaction     (null)
> 446   idle    (null)
> 1     active  localhost
> The idle connections are all trying to run COMMIT
> The "idle in transaction" connections are all trying to run 
> SELECT job.id AS job_id, job.dag_id AS job_dag_id, job.state AS job_state, 
> job.job_type AS job_job_type, job.start_date AS job_start_date, job.end_date 
> AS job_end_date, job.latest_heartbeat AS job_latest_heartbeat, 
> job.executor_class AS job_executor_class, job.hostname AS job_hostname, 
> job.unixname AS job_unixname 
> FROM job 
> WHERE job.id = 213823 
>  LIMIT 1
> with differing job.ids of course.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to