arjunanan6 opened a new issue, #33402:
URL: https://github.com/apache/airflow/issues/33402
### Apache Airflow version
2.6.3
### What happened
This is a new problem that started happening overnight for us on only one of
our environments. The only change introduced was to update `job_heartbeat_sec`
to `15` from the default of `5`.
What happens is that tasks are executed normally, but the worker pods remain
behind in a `Completed` state regardless of whether the task was successful or
not. This is an example of the log from the scheduler pertaining to one of the
tasks:
```
<TaskInstance: WINRM_TESTER.WINRM_ECHO
manual__2023-08-15T08:01:31.206777+00:00 [scheduled]>
[2023-08-15T08:01:34.110+0000] {scheduler_job_runner.py:476} INFO - DAG
WINRM_TESTER has 0/16 running and queued tasks
<TaskInstance: WINRM_TESTER.WINRM_ECHO
manual__2023-08-15T08:01:31.206777+00:00 [scheduled]>
[2023-08-15T08:01:34.161+0000] {scheduler_job_runner.py:625} INFO - Sending
TaskInstanceKey(dag_id='WINRM_TESTER', task_id='WINRM_ECHO',
run_id='manual__2023-08-15T08:01:31.206777+00:00', try_number=1, map_index=-1)
to executor with priority 1 and queue default
[2023-08-15T08:01:34.161+0000] {base_executor.py:147} INFO - Adding to
queue: ['airflow', 'tasks', 'run', 'WINRM_TESTER', 'WINRM_ECHO',
'manual__2023-08-15T08:01:31.206777+00:00', '--local', '--subdir',
'DAGS_FOLDER/health_check.py']
[2023-08-15T08:01:34.219+0000] {kubernetes_executor.py:637} INFO - Add task
TaskInstanceKey(dag_id='WINRM_TESTER', task_id='WINRM_ECHO',
run_id='manual__2023-08-15T08:01:31.206777+00:00', try_number=1, map_index=-1)
with command ['airflow', 'tasks', 'run', 'WINRM_TESTER', 'WINRM_ECHO',
'manual__2023-08-15T08:01:31.206777+00:00', '--local', '--subdir',
'DAGS_FOLDER/health_check.py']
[2023-08-15T08:01:34.251+0000] {kubernetes_executor.py:379} INFO - Creating
kubernetes pod for job is TaskInstanceKey(dag_id='WINRM_TESTER',
task_id='WINRM_ECHO', run_id='manual__2023-08-15T08:01:31.206777+00:00',
try_number=1, map_index=-1), with pod name winrm-tester-winrm-echo-d3xusoqk
[2023-08-15T08:01:34.306+0000] {scheduler_job_runner.py:677} INFO - Received
executor event with state queued for task instance
TaskInstanceKey(dag_id='WINRM_TESTER', task_id='WINRM_ECHO',
run_id='manual__2023-08-15T08:01:31.206777+00:00', try_number=1, map_index=-1)
[2023-08-15T08:01:34.575+0000] {scheduler_job_runner.py:703} INFO - Setting
external_id for <TaskInstance: WINRM_TESTER.WINRM_ECHO
manual__2023-08-15T08:01:31.206777+00:00 [queued]> to 219297
[2023-08-15T08:05:26.575+0000] {kubernetes_executor.py:951} INFO -
Attempting to adopt pod winrm-tester-winrm-echo-d3xusoqk
```
The last thing of substance I see is this "Attempting to adopt pod", and
that goes for all the tasks that are done but the worker pod remains in state
`Completed`.
Following the activity above, the scheduler logs are filled with:
```
[2023-08-15T08:05:24.816+0000] {kubernetes_executor.py:552} INFO - Found 0
queued task instances
[2023-08-15T08:05:25.121+0000] {scheduler_job_runner.py:1553} INFO -
Resetting orphaned tasks for active dag runs
[2023-08-15T08:05:25.265+0000] {scheduler_job_runner.py:1576} INFO - Marked
1 SchedulerJob instances as failed
[2023-08-15T08:05:25.970+0000] {kubernetes_executor.py:912} INFO -
attempting to adopt pod
it-erpbut-grac-action-usage-sync-h-it-erpbut-grac-action-usage-sync-h-prbfigj4
[2023-08-15T08:05:26.027+0000] {kubernetes_executor.py:912} INFO -
attempting to adopt pod
sap-wmd-agwi-rsrvd-timeout-h-sap-wmd-agwi-rsrvd-timeout-h-4diptrah
[2023-08-15T08:05:26.085+0000] {kubernetes_executor.py:912} INFO -
attempting to adopt pod
sap-wmd-send-mail-to-releaser-sap-wmd-send-mail-to-releaser-yusyanm6
[2023-08-15T08:05:26.136+0000] {kubernetes_executor.py:912} INFO -
attempting to adopt pod tableau-backup-backup-95vpqqsj
[2023-08-15T08:05:26.397+0000] {kubernetes_executor.py:951} INFO -
Attempting to adopt pod
pow-eagle-to-popplan-export-pow-eagle-to-popplan-export-1bz9c733
[2023-08-15T08:05:26.445+0000] {kubernetes_executor.py:951} INFO -
Attempting to adopt pod
pow-eagle-to-sap-isu-export-pow-eagle-to-sap-isu-export-ly4l1tae
[2023-08-15T08:05:26.513+0000] {kubernetes_executor.py:951} INFO -
Attempting to adopt pod
sap-wmd-send-mail-to-releaser-sap-wmd-send-mail-to-releaser-ghqmaqfc
[2023-08-15T08:05:26.575+0000] {kubernetes_executor.py:951} INFO -
Attempting to adopt pod winrm-tester-winrm-echo-d3xusoqk
```
Any idea what is going on here? New tasks continue to be executed, but we
are left with this mess of `Completed` pods.
The strange part about this is that if I kill the scheduler pod and have it
spawn again, it solves the problem, ie - the pods get deleted. So far this is
only happening on one of our environments, and the configuration is practically
identical on both - with 2 scheduler instances.
### What you think should happen instead
_No response_
### How to reproduce
N/A
### Operating System
Debian GNU/Linux 11
### Versions of Apache Airflow Providers
_No response_
### Deployment
Official Apache Airflow Helm Chart
### Deployment details
Deployed on AKS with official helm chart.
### Anything else
_No response_
### Are you willing to submit PR?
- [ ] Yes I am willing to submit a PR!
### Code of Conduct
- [X] I agree to follow this project's [Code of
Conduct](https://github.com/apache/airflow/blob/main/CODE_OF_CONDUCT.md)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]