[GitHub] [airflow] arjunanan6 opened a new issue, #33402: Scheduler fails to remove pods in Completed state

via GitHub Tue, 15 Aug 2023 01:25:30 -0700


arjunanan6 opened a new issue, #33402:
URL: https://github.com/apache/airflow/issues/33402


   ### Apache Airflow version
   
   2.6.3
   
   ### What happened
   
   This is a new problem that started happening overnight for us on only one of 
our environments. The only change introduced was to update `job_heartbeat_sec` 
to `15` from the default of `5`.
   
   What happens is that tasks are executed normally, but the worker pods remain 
behind in a `Completed` state regardless of whether the task was successful or 
not. This is an example of the log from the scheduler pertaining to one of the 
tasks:
   
   ```
   <TaskInstance: WINRM_TESTER.WINRM_ECHO 
manual__2023-08-15T08:01:31.206777+00:00 [scheduled]>
   [2023-08-15T08:01:34.110+0000] {scheduler_job_runner.py:476} INFO - DAG 
WINRM_TESTER has 0/16 running and queued tasks
   <TaskInstance: WINRM_TESTER.WINRM_ECHO 
manual__2023-08-15T08:01:31.206777+00:00 [scheduled]>
   [2023-08-15T08:01:34.161+0000] {scheduler_job_runner.py:625} INFO - Sending 
TaskInstanceKey(dag_id='WINRM_TESTER', task_id='WINRM_ECHO', 
run_id='manual__2023-08-15T08:01:31.206777+00:00', try_number=1, map_index=-1) 
to executor with priority 1 and queue default
   [2023-08-15T08:01:34.161+0000] {base_executor.py:147} INFO - Adding to 
queue: ['airflow', 'tasks', 'run', 'WINRM_TESTER', 'WINRM_ECHO', 
'manual__2023-08-15T08:01:31.206777+00:00', '--local', '--subdir', 
'DAGS_FOLDER/health_check.py']
   [2023-08-15T08:01:34.219+0000] {kubernetes_executor.py:637} INFO - Add task 
TaskInstanceKey(dag_id='WINRM_TESTER', task_id='WINRM_ECHO', 
run_id='manual__2023-08-15T08:01:31.206777+00:00', try_number=1, map_index=-1) 
with command ['airflow', 'tasks', 'run', 'WINRM_TESTER', 'WINRM_ECHO', 
'manual__2023-08-15T08:01:31.206777+00:00', '--local', '--subdir', 
'DAGS_FOLDER/health_check.py']
   [2023-08-15T08:01:34.251+0000] {kubernetes_executor.py:379} INFO - Creating 
kubernetes pod for job is TaskInstanceKey(dag_id='WINRM_TESTER', 
task_id='WINRM_ECHO', run_id='manual__2023-08-15T08:01:31.206777+00:00', 
try_number=1, map_index=-1), with pod name winrm-tester-winrm-echo-d3xusoqk
   [2023-08-15T08:01:34.306+0000] {scheduler_job_runner.py:677} INFO - Received 
executor event with state queued for task instance 
TaskInstanceKey(dag_id='WINRM_TESTER', task_id='WINRM_ECHO', 
run_id='manual__2023-08-15T08:01:31.206777+00:00', try_number=1, map_index=-1)
   [2023-08-15T08:01:34.575+0000] {scheduler_job_runner.py:703} INFO - Setting 
external_id for <TaskInstance: WINRM_TESTER.WINRM_ECHO 
manual__2023-08-15T08:01:31.206777+00:00 [queued]> to 219297
   [2023-08-15T08:05:26.575+0000] {kubernetes_executor.py:951} INFO - 
Attempting to adopt pod winrm-tester-winrm-echo-d3xusoqk
   ```
   The last thing of substance I see is this "Attempting to adopt pod", and 
that goes for all the tasks that are done but the worker pod remains in state 
`Completed`.
   
   Following the activity above, the scheduler logs are filled with:
   
   ```
   [2023-08-15T08:05:24.816+0000] {kubernetes_executor.py:552} INFO - Found 0 
queued task instances
   [2023-08-15T08:05:25.121+0000] {scheduler_job_runner.py:1553} INFO - 
Resetting orphaned tasks for active dag runs
   [2023-08-15T08:05:25.265+0000] {scheduler_job_runner.py:1576} INFO - Marked 
1 SchedulerJob instances as failed
   [2023-08-15T08:05:25.970+0000] {kubernetes_executor.py:912} INFO - 
attempting to adopt pod 
it-erpbut-grac-action-usage-sync-h-it-erpbut-grac-action-usage-sync-h-prbfigj4
   [2023-08-15T08:05:26.027+0000] {kubernetes_executor.py:912} INFO - 
attempting to adopt pod 
sap-wmd-agwi-rsrvd-timeout-h-sap-wmd-agwi-rsrvd-timeout-h-4diptrah
   [2023-08-15T08:05:26.085+0000] {kubernetes_executor.py:912} INFO - 
attempting to adopt pod 
sap-wmd-send-mail-to-releaser-sap-wmd-send-mail-to-releaser-yusyanm6
   [2023-08-15T08:05:26.136+0000] {kubernetes_executor.py:912} INFO - 
attempting to adopt pod tableau-backup-backup-95vpqqsj
   [2023-08-15T08:05:26.397+0000] {kubernetes_executor.py:951} INFO - 
Attempting to adopt pod 
pow-eagle-to-popplan-export-pow-eagle-to-popplan-export-1bz9c733
   [2023-08-15T08:05:26.445+0000] {kubernetes_executor.py:951} INFO - 
Attempting to adopt pod 
pow-eagle-to-sap-isu-export-pow-eagle-to-sap-isu-export-ly4l1tae
   [2023-08-15T08:05:26.513+0000] {kubernetes_executor.py:951} INFO - 
Attempting to adopt pod 
sap-wmd-send-mail-to-releaser-sap-wmd-send-mail-to-releaser-ghqmaqfc
   [2023-08-15T08:05:26.575+0000] {kubernetes_executor.py:951} INFO - 
Attempting to adopt pod winrm-tester-winrm-echo-d3xusoqk
   ```
   
   Any idea what is going on here? New tasks continue to be executed, but we 
are left with this mess of `Completed` pods. 
   
   
   The strange part about this is that if I kill the scheduler pod and have it 
spawn again, it solves the problem, ie - the pods get deleted.  So far this is 
only happening on one of our environments, and the configuration is practically 
identical on both - with 2 scheduler instances.
   
   ### What you think should happen instead
   
   _No response_
   
   ### How to reproduce
   
   N/A
   
   ### Operating System
   
   Debian GNU/Linux 11
   
   ### Versions of Apache Airflow Providers
   
   _No response_
   
   ### Deployment
   
   Official Apache Airflow Helm Chart
   
   ### Deployment details
   
   Deployed on AKS with official helm chart.
   
   ### Anything else
   
   _No response_
   
   ### Are you willing to submit PR?
   
   - [ ] Yes I am willing to submit a PR!
   
   ### Code of Conduct
   
   - [X] I agree to follow this project's [Code of 
Conduct](https://github.com/apache/airflow/blob/main/CODE_OF_CONDUCT.md)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [airflow] arjunanan6 opened a new issue, #33402: Scheduler fails to remove pods in Completed state

Reply via email to