[jira] [Commented] (AIRFLOW-2271) Undead tasks are heartbeating and not getting killed
[ https://issues.apache.org/jira/browse/AIRFLOW-2271?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16422219#comment-16422219 ] Greg commented on AIRFLOW-2271: --- Thanks [~joygao] Our implementation of (2) is a little bit different, mostly it follows the approach from [https://psutil.readthedocs.io/en/latest/#kill-process-tree] What is the current state of the PR? WIP on code fixes or 'waiting for review'? > Undead tasks are heartbeating and not getting killed > > > Key: AIRFLOW-2271 > URL: https://issues.apache.org/jira/browse/AIRFLOW-2271 > Project: Apache Airflow > Issue Type: Bug > Components: worker >Affects Versions: 1.9.0 >Reporter: Greg >Priority: Major > Attachments: Airflow_zombies_masked.png > > > Background: We had a resource leak in some of our Airflow operators, so after > the task is completed, the connection pool was not disposed and the processes > were still running (see attached screenshot). It caused the execution pool > (size=16) being exhaused after a couple of days. > Investigation: > We checked those task instances and related jobs in the database, and found > mismatch: > SQL: > {code:sql} > select > ti.execution_date, > ti.state AS task_state, > ti.start_date AS task_start_dt, > ti.end_date As task_end_dt, > j.id AS job_id, > j.state AS job_state, > j.start_date AS job_start_dt, > j.end_date AS job_end_dt, > j.latest_heartbeat > from task_instance ti > join job j > on j.id=ti.job_id > where ti.task_id='backup_data_tables' > order by task_start_dt DESC > {code} > ||execution_date||task_state||task_start_dt||task_end_dt||job_id||job_state||job_start_dt||job_end_dt||latest_heartbeat|| > |2018-03-23 23:00:00|success|2018-03-27 08:42:12.846058|2018-03-27 > 08:42:17.408723|10925|success|2018-03-27 08:42:12.768759|2018-03-27 > 08:42:22.815474|2018-03-27 08:42:12.768773| > |2018-03-22 23:00:00|success|2018-03-23 23:02:44.079996|2018-03-24 > 01:08:52.842612|9683|running|2018-03-23 23:02:44.010813| |2018-03-26 > 11:29:15.928836| > |2018-03-21 23:00:00|success|2018-03-22 23:02:14.254779|2018-03-23 > 01:07:58.322927|9075|running|2018-03-22 23:02:14.199652| |2018-03-26 > 11:29:16.570076| > |2018-03-20 23:00:00|success|2018-03-21 23:02:33.417882|2018-03-22 > 01:16:56.695002|8475|running|2018-03-21 23:02:33.33754| |2018-03-26 > 11:29:16.529516| > |2018-03-19 23:00:00|success|2018-03-21 13:20:36.084062|2018-03-21 > 15:32:51.263954|8412|running|2018-03-21 13:20:36.026206| |2018-03-26 > 11:29:16.529413| > As shown in the result set above, jobs of the completed tasks are still > running and heartbeating several days after the actual task is completed, > stopped only after we killed them manually. > In the log files of the tasks we see a bunch of entries like below, which > show that _kill_process_tree()_ method is envoked every ~5sec: > {code:java} > [2018-03-28 13:03:33,013] \{{helpers.py:269}} DEBUG - There are no descendant > processes to kill > [2018-03-28 13:03:38,211] \{{helpers.py:269}} DEBUG - There are no descendant > processes to kill > [2018-03-28 13:03:43,290] \{{helpers.py:269}} DEBUG - There are no descendant > processes to kill > [2018-03-28 13:03:48,416] \{{helpers.py:269}} DEBUG - There are no descendant > processes to kill > [2018-03-28 13:03:53,604] \{{helpers.py:269}} DEBUG - There are no descendant > processes to kill > {code} > After some debugging we found that _LocalTaskJob.terminating_ flag is set to > _True_, but processes are still not getting killed, moreover, job is still > heartbeating. > Expected result: Airflow is responsible for shutting down the processes, not > leaving undeads, even if force kill is needed. > Possible fix: > We did the following two changes in the code (we have fixed it in our fork): > - _LocalTaskJob._execute_ - do not heartbeat if task is terminating > - _kill_process_tree_ - add bool argument kill_root, and kill the root > process after descendants if True > After that all the tasks having that resource leak were shutting down > correctly, without leaving any "undead" processes. > Would love to get some feedback from expects about this issue and the fix. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (AIRFLOW-2271) Undead tasks are heartbeating and not getting killed
[ https://issues.apache.org/jira/browse/AIRFLOW-2271?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16420804#comment-16420804 ] Joy Gao commented on AIRFLOW-2271: -- Please see [PR 2975|https://github.com/apache/incubator-airflow/pull/2975], it addresses the second change you have described above to fix zombie thread. > Undead tasks are heartbeating and not getting killed > > > Key: AIRFLOW-2271 > URL: https://issues.apache.org/jira/browse/AIRFLOW-2271 > Project: Apache Airflow > Issue Type: Bug > Components: worker >Affects Versions: 1.9.0 >Reporter: Greg >Priority: Major > Attachments: Airflow_zombies_masked.png > > > Background: We had a resource leak in some of our Airflow operators, so after > the task is completed, the connection pool was not disposed and the processes > were still running (see attached screenshot). It caused the execution pool > (size=16) being exhaused after a couple of days. > Investigation: > We checked those task instances and related jobs in the database, and found > mismatch: > SQL: > {code:sql} > select > ti.execution_date, > ti.state AS task_state, > ti.start_date AS task_start_dt, > ti.end_date As task_end_dt, > j.id AS job_id, > j.state AS job_state, > j.start_date AS job_start_dt, > j.end_date AS job_end_dt, > j.latest_heartbeat > from task_instance ti > join job j > on j.id=ti.job_id > where ti.task_id='backup_data_tables' > order by task_start_dt DESC > {code} > ||execution_date||task_state||task_start_dt||task_end_dt||job_id||job_state||job_start_dt||job_end_dt||latest_heartbeat|| > |2018-03-23 23:00:00|success|2018-03-27 08:42:12.846058|2018-03-27 > 08:42:17.408723|10925|success|2018-03-27 08:42:12.768759|2018-03-27 > 08:42:22.815474|2018-03-27 08:42:12.768773| > |2018-03-22 23:00:00|success|2018-03-23 23:02:44.079996|2018-03-24 > 01:08:52.842612|9683|running|2018-03-23 23:02:44.010813| |2018-03-26 > 11:29:15.928836| > |2018-03-21 23:00:00|success|2018-03-22 23:02:14.254779|2018-03-23 > 01:07:58.322927|9075|running|2018-03-22 23:02:14.199652| |2018-03-26 > 11:29:16.570076| > |2018-03-20 23:00:00|success|2018-03-21 23:02:33.417882|2018-03-22 > 01:16:56.695002|8475|running|2018-03-21 23:02:33.33754| |2018-03-26 > 11:29:16.529516| > |2018-03-19 23:00:00|success|2018-03-21 13:20:36.084062|2018-03-21 > 15:32:51.263954|8412|running|2018-03-21 13:20:36.026206| |2018-03-26 > 11:29:16.529413| > As shown in the result set above, jobs of the completed tasks are still > running and heartbeating several days after the actual task is completed, > stopped only after we killed them manually. > In the log files of the tasks we see a bunch of entries like below, which > show that _kill_process_tree()_ method is envoked every ~5sec: > {code:java} > [2018-03-28 13:03:33,013] \{{helpers.py:269}} DEBUG - There are no descendant > processes to kill > [2018-03-28 13:03:38,211] \{{helpers.py:269}} DEBUG - There are no descendant > processes to kill > [2018-03-28 13:03:43,290] \{{helpers.py:269}} DEBUG - There are no descendant > processes to kill > [2018-03-28 13:03:48,416] \{{helpers.py:269}} DEBUG - There are no descendant > processes to kill > [2018-03-28 13:03:53,604] \{{helpers.py:269}} DEBUG - There are no descendant > processes to kill > {code} > After some debugging we found that _LocalTaskJob.terminating_ flag is set to > _True_, but processes are still not getting killed, moreover, job is still > heartbeating. > Expected result: Airflow is responsible for shutting down the processes, not > leaving undeads, even if force kill is needed. > Possible fix: > We did the following two changes in the code (we have fixed it in our fork): > - _LocalTaskJob._execute_ - do not heartbeat if task is terminating > - _kill_process_tree_ - add bool argument kill_root, and kill the root > process after descendants if True > After that all the tasks having that resource leak were shutting down > correctly, without leaving any "undead" processes. > Would love to get some feedback from expects about this issue and the fix. -- This message was sent by Atlassian JIRA (v7.6.3#76005)