[jira] [Commented] (AIRFLOW-2271) Undead tasks are heartbeating and not getting killed

2018-04-02 Thread Greg (JIRA)

[ 
https://issues.apache.org/jira/browse/AIRFLOW-2271?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16422219#comment-16422219
 ] 

Greg commented on AIRFLOW-2271:
---

Thanks [~joygao]

Our implementation of (2) is a little bit different, mostly it follows the 
approach from [https://psutil.readthedocs.io/en/latest/#kill-process-tree]

What is the current state of the PR? WIP on code fixes or 'waiting for review'?

> Undead tasks are heartbeating and not getting killed
> 
>
> Key: AIRFLOW-2271
> URL: https://issues.apache.org/jira/browse/AIRFLOW-2271
> Project: Apache Airflow
>  Issue Type: Bug
>  Components: worker
>Affects Versions: 1.9.0
>Reporter: Greg
>Priority: Major
> Attachments: Airflow_zombies_masked.png
>
>
> Background: We had a resource leak in some of our Airflow operators, so after 
> the task is completed, the connection pool was not disposed and the processes 
> were still running (see attached screenshot). It caused the execution pool 
> (size=16) being exhaused after a couple of days.
>  Investigation:
>  We checked those task instances and related jobs in the database, and found 
> mismatch:
>  SQL:
> {code:sql}
> select 
> ti.execution_date,
> ti.state AS task_state,
> ti.start_date AS task_start_dt,
> ti.end_date As task_end_dt,
> j.id AS job_id,
> j.state AS job_state,
> j.start_date AS job_start_dt,
> j.end_date AS job_end_dt,
> j.latest_heartbeat
> from task_instance ti
> join job j
> on j.id=ti.job_id
> where ti.task_id='backup_data_tables'
> order by task_start_dt DESC
> {code}
> ||execution_date||task_state||task_start_dt||task_end_dt||job_id||job_state||job_start_dt||job_end_dt||latest_heartbeat||
> |2018-03-23 23:00:00|success|2018-03-27 08:42:12.846058|2018-03-27 
> 08:42:17.408723|10925|success|2018-03-27 08:42:12.768759|2018-03-27 
> 08:42:22.815474|2018-03-27 08:42:12.768773|
> |2018-03-22 23:00:00|success|2018-03-23 23:02:44.079996|2018-03-24 
> 01:08:52.842612|9683|running|2018-03-23 23:02:44.010813| |2018-03-26 
> 11:29:15.928836|
> |2018-03-21 23:00:00|success|2018-03-22 23:02:14.254779|2018-03-23 
> 01:07:58.322927|9075|running|2018-03-22 23:02:14.199652| |2018-03-26 
> 11:29:16.570076|
> |2018-03-20 23:00:00|success|2018-03-21 23:02:33.417882|2018-03-22 
> 01:16:56.695002|8475|running|2018-03-21 23:02:33.33754| |2018-03-26 
> 11:29:16.529516|
> |2018-03-19 23:00:00|success|2018-03-21 13:20:36.084062|2018-03-21 
> 15:32:51.263954|8412|running|2018-03-21 13:20:36.026206| |2018-03-26 
> 11:29:16.529413|
> As shown in the result set above, jobs of the completed tasks are still 
> running and heartbeating several days after the actual task is completed, 
> stopped only after we killed them manually.
>  In the log files of the tasks we see a bunch of entries like below, which 
> show that _kill_process_tree()_ method is envoked every ~5sec:
> {code:java}
> [2018-03-28 13:03:33,013] \{{helpers.py:269}} DEBUG - There are no descendant 
> processes to kill
> [2018-03-28 13:03:38,211] \{{helpers.py:269}} DEBUG - There are no descendant 
> processes to kill
> [2018-03-28 13:03:43,290] \{{helpers.py:269}} DEBUG - There are no descendant 
> processes to kill
> [2018-03-28 13:03:48,416] \{{helpers.py:269}} DEBUG - There are no descendant 
> processes to kill
> [2018-03-28 13:03:53,604] \{{helpers.py:269}} DEBUG - There are no descendant 
> processes to kill
> {code}
> After some debugging we found that _LocalTaskJob.terminating_ flag is set to 
> _True_, but processes are still not getting killed, moreover, job is still 
> heartbeating.
>  Expected result: Airflow is responsible for shutting down the processes, not 
> leaving undeads, even if force kill is needed.
> Possible fix:
>  We did the following two changes in the code (we have fixed it in our fork):
>  - _LocalTaskJob._execute_ - do not heartbeat if task is terminating
>  - _kill_process_tree_ - add bool argument kill_root, and kill the root 
> process after descendants if True
> After that all the tasks having that resource leak were shutting down 
> correctly, without leaving any "undead" processes.
> Would love to get some feedback from expects about this issue and the fix.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (AIRFLOW-2271) Undead tasks are heartbeating and not getting killed

2018-03-30 Thread Joy Gao (JIRA)

[ 
https://issues.apache.org/jira/browse/AIRFLOW-2271?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16420804#comment-16420804
 ] 

Joy Gao commented on AIRFLOW-2271:
--

Please see [PR 2975|https://github.com/apache/incubator-airflow/pull/2975],  it 
addresses the second change you have described above to fix zombie thread.

> Undead tasks are heartbeating and not getting killed
> 
>
> Key: AIRFLOW-2271
> URL: https://issues.apache.org/jira/browse/AIRFLOW-2271
> Project: Apache Airflow
>  Issue Type: Bug
>  Components: worker
>Affects Versions: 1.9.0
>Reporter: Greg
>Priority: Major
> Attachments: Airflow_zombies_masked.png
>
>
> Background: We had a resource leak in some of our Airflow operators, so after 
> the task is completed, the connection pool was not disposed and the processes 
> were still running (see attached screenshot). It caused the execution pool 
> (size=16) being exhaused after a couple of days.
>  Investigation:
>  We checked those task instances and related jobs in the database, and found 
> mismatch:
>  SQL:
> {code:sql}
> select 
> ti.execution_date,
> ti.state AS task_state,
> ti.start_date AS task_start_dt,
> ti.end_date As task_end_dt,
> j.id AS job_id,
> j.state AS job_state,
> j.start_date AS job_start_dt,
> j.end_date AS job_end_dt,
> j.latest_heartbeat
> from task_instance ti
> join job j
> on j.id=ti.job_id
> where ti.task_id='backup_data_tables'
> order by task_start_dt DESC
> {code}
> ||execution_date||task_state||task_start_dt||task_end_dt||job_id||job_state||job_start_dt||job_end_dt||latest_heartbeat||
> |2018-03-23 23:00:00|success|2018-03-27 08:42:12.846058|2018-03-27 
> 08:42:17.408723|10925|success|2018-03-27 08:42:12.768759|2018-03-27 
> 08:42:22.815474|2018-03-27 08:42:12.768773|
> |2018-03-22 23:00:00|success|2018-03-23 23:02:44.079996|2018-03-24 
> 01:08:52.842612|9683|running|2018-03-23 23:02:44.010813| |2018-03-26 
> 11:29:15.928836|
> |2018-03-21 23:00:00|success|2018-03-22 23:02:14.254779|2018-03-23 
> 01:07:58.322927|9075|running|2018-03-22 23:02:14.199652| |2018-03-26 
> 11:29:16.570076|
> |2018-03-20 23:00:00|success|2018-03-21 23:02:33.417882|2018-03-22 
> 01:16:56.695002|8475|running|2018-03-21 23:02:33.33754| |2018-03-26 
> 11:29:16.529516|
> |2018-03-19 23:00:00|success|2018-03-21 13:20:36.084062|2018-03-21 
> 15:32:51.263954|8412|running|2018-03-21 13:20:36.026206| |2018-03-26 
> 11:29:16.529413|
> As shown in the result set above, jobs of the completed tasks are still 
> running and heartbeating several days after the actual task is completed, 
> stopped only after we killed them manually.
>  In the log files of the tasks we see a bunch of entries like below, which 
> show that _kill_process_tree()_ method is envoked every ~5sec:
> {code:java}
> [2018-03-28 13:03:33,013] \{{helpers.py:269}} DEBUG - There are no descendant 
> processes to kill
> [2018-03-28 13:03:38,211] \{{helpers.py:269}} DEBUG - There are no descendant 
> processes to kill
> [2018-03-28 13:03:43,290] \{{helpers.py:269}} DEBUG - There are no descendant 
> processes to kill
> [2018-03-28 13:03:48,416] \{{helpers.py:269}} DEBUG - There are no descendant 
> processes to kill
> [2018-03-28 13:03:53,604] \{{helpers.py:269}} DEBUG - There are no descendant 
> processes to kill
> {code}
> After some debugging we found that _LocalTaskJob.terminating_ flag is set to 
> _True_, but processes are still not getting killed, moreover, job is still 
> heartbeating.
>  Expected result: Airflow is responsible for shutting down the processes, not 
> leaving undeads, even if force kill is needed.
> Possible fix:
>  We did the following two changes in the code (we have fixed it in our fork):
>  - _LocalTaskJob._execute_ - do not heartbeat if task is terminating
>  - _kill_process_tree_ - add bool argument kill_root, and kill the root 
> process after descendants if True
> After that all the tasks having that resource leak were shutting down 
> correctly, without leaving any "undead" processes.
> Would love to get some feedback from expects about this issue and the fix.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)