GitHub user GustavoBuosi created a discussion: Possible Race Condition on Task 
Scheduling in Kubernetes Executor

Hello! In my company, me, @NiltonDuarte and @leandroszikora are deploying 
Airflow 3.1.7 for many teams and providing support for their infrastructure. In 
the case we are reporting here, our current setup has:

* Airflow 3.1.7
* Python 3.13.11
* `apache-airflow-providers-cncf-kubernetes==10.12.3`
* Linux Images
* 3 API Server Pods
* 3 Scheduler Pods
* 3 DAG Processor Pods

Currently, we have already observed some issues that may affect task runtimes, 
and most of them were indirectly related to infra resources that interfered 
with the Scheduler and API Server behavior.

However, in an environment that only runs on OnDemand instances in Azure 
Kubernetes Services we realized that some tasks would still die abruptly, not 
related to OoM issues. This was observed during one of the attempts of an 
ExternalTaskSensor in Reschedule mode, that was running for approximately 3 
hours and had a timeout of 8 hours.

Two workers were created for the same sensor in a ~1 second time window. 

### Attempt 1 (started at 2026-06-04T00:38:48): 

```json
{"timestamp":"2026-06-04T00:38:48.358475Z","level":"info","event":"Executing 
workload","workload": ...}

{"timestamp":"2026-06-04T00:38:55.696277Z","level":"info","event":"Rescheduling 
task, marking task as 
UP_FOR_RESCHEDULE","logger":"task","filename":"supervisor.py","lineno":1806}
{"timestamp":"2026-06-04T00:38:55.712667Z","level":"error","event":"API server 
error","status_code":409,"detail":{"detail":{"reason":"invalid_state","message":"TI
 was not in the running state so it cannot be 
updated","previous_state":"failed"}}, ...}
```
### Attempt 2 (started at 2026-06-04T00:38:49, one second later):

```json
{"timestamp":"2026-06-04T00:38:49.686111Z","level":"info","event":"Executing 
workload","workload": ...}
{"timestamp":"2026-06-04T00:38:50.500902Z","level":"info","event":"Process 
exited","pid":17,"exit_code":-9,"signal_sent":"SIGKILL","logger":"supervisor","filename":"supervisor.py","lineno":710}
```

We suspect two of out schedulers created a race condition and in the end we get 
a similar behavior as seen in #63183.

We have looked at 2 of our schedulers logs (the third one did not contain 
relevant traces) and interestingly enough found out that:

* Scheduler 1 started attempt 1 of the task. After it got killed, it constantly 
**kept emitting the event that attempt 1 failed continuously** - most likely it 
keeps fetching incorrect state:

```json
2026-06-04T02:34:12.394506Z [warning  ] Event: dag-... Failed, task: 
dag.task.1, annotations: <omitted> 
[airflow.providers.cncf.kubernetes.executors.kubernetes_executor_utils.KubernetesJobWatcher]
 loc=kubernetes_executor_utils.py:309
```

* Scheduler 2 started attempt 2 of the task. However, it emitted the event 
about this attempt once and moved on:

```json
2026-06-04T02:34:12.394506Z [warning  ] Event: dag-... Failed, task: 
dag.task.2, annotations: <omitted> 
[airflow.providers.cncf.kubernetes.executors.kubernetes_executor_utils.KubernetesJobWatcher]
 loc=kubernetes_executor_utils.py:309
```

We expected that Scheduler 2 would be able to check that there is attempt 1 
that is in UP_FOR_RESCHEDULE and would not generate a second run.

After searching in the repo, the only discussion we've found similar to this is 
#57041 - however, it has not been replied. As its author states, we also have 
checked that we have use_row_level_locking=True.

There does seem to have a PR that would fix the issue for cases when both task 
runs are on success #63355, but it seems our scenario described here is 
different.

We can provide additional logs and info if desired.


GitHub link: https://github.com/apache/airflow/discussions/68071

----
This is an automatically sent email for [email protected].
To unsubscribe, please send an email to: [email protected]

Reply via email to