Baisang opened a new issue, #62970:
URL: https://github.com/apache/airflow/issues/62970
### Apache Airflow version
3.1.7
### If "Other Airflow 3 version" selected, which one?
_No response_
### What happened?
There appears to be a bug in Airflow 3 where TriggeredDagRunOperator is not
retrying for tasks run via the Kubernetes executor. When these child/triggered
tasks time out, the TriggeredDagRunOperator does not retry. This only has an
effect when the Kubernetes executor is used (i.e. the issue is not observed
using Celery)
### What you think should happen instead?
_No response_
### How to reproduce
This is reproducible by using a DAG like so and launching airflow in kind
(kubernetes in docker)
```python3
"""
Repro for TriggerDagRunOperator retry bug in Airflow 3.
Child DAG: sleeps 3 minutes but has dagrun_timeout of 1 minute → guaranteed
timeout.
Parent DAG: triggers child with wait_for_completion=True, retries=2.
Expected: parent task fails and retries 2 times.
Observed (bug): parent task fails once and never retries.
To test: trigger `test_trigger_parent` manually and watch the parent task's
`trigger_child_dag` task. It should retry twice after the child times out.
"""
from datetime import datetime, timedelta
from airflow.providers.standard.operators.bash import BashOperator
from airflow.providers.standard.operators.trigger_dagrun import
TriggerDagRunOperator
from airflow.sdk import DAG
# --- Child DAG: guaranteed to time out ---
test_timeout_child = DAG(
dag_id="test_timeout_child",
start_date=datetime(2024, 1, 1),
schedule=None,
max_active_runs=1,
dagrun_timeout=timedelta(minutes=1),
)
sleep_too_long = BashOperator(
task_id="sleep_too_long",
bash_command="sleep 180",
dag=test_timeout_child,
)
# --- Parent DAG: triggers child and should retry on failure ---
test_trigger_parent = DAG(
dag_id="test_trigger_parent",
start_date=datetime(2024, 1, 1),
schedule=None,
max_active_runs=1,
)
TriggerDagRunOperator(
task_id="trigger_child_dag",
trigger_dag_id="test_timeout_child",
wait_for_completion=True,
poke_interval=10,
retries=2,
retry_delay=timedelta(seconds=30),
dag=test_trigger_parent,
)
```
<img width="3454" height="720" alt="Image"
src="https://github.com/user-attachments/assets/214df05c-f248-4da4-81ec-ca43d81d8fef"
/>
This is a test by launching Airflow 3 in Kind with the kubernetes operator
and running the parent DAG. Notice that the dag run has not retried at all.
The opposite behavior (i.e. the parent DAG retrying the child DAG) is
observed if running Airflow locally i.e. using the CeleryExecutor
We can confirm what is going on in the logs:
```
2026-03-05T07:21:28.241156Z [info ] Received executor event with state
skipped for task instance TaskInstanceKey(dag_id='test_timeout_child',
task_id='sleep_too_long', run_id='manual__2026-03-05T07:20:24.039198+00:00',
try_number=1, map_index=-1)
[airflow.jobs.scheduler_job_runner.SchedulerJobRunner]
loc=scheduler_job_runner.py:822
```
i.e. the child DAG timed out and is marked as skipped
parent run is then marked as failed
```
2026-03-05T07:21:34.665690Z [info ] Marking run <DagRun
test_trigger_parent @ 2026-03-05 07:20:16+00:00:
manual__2026-03-05T07:20:17.339068+00:00, state:running, queued_at: 2026-03-05
07:20:17.346119+00:00. run_type: manual> failed [airflow.models.dagrun.DagRun]
loc=dagrun.py:1171
```
then it looks like we finally get a report back from the child dag
```
2026-03-05T07:21:36.245789Z [info ] Received executor event with state
failed for task instance TaskInstanceKey(dag_id='test_trigger_parent',
task_id='trigger_child_dag', run_id='manual__2026-03-05T07:20:17.339068+00:00',
try_number=1, map_index=-1)
[airflow.jobs.scheduler_job_runner.SchedulerJobRunner]
loc=scheduler_job_runner.py:822
2026-03-05T07:21:36.248753Z [info ] TaskInstance Finished:
dag_id=test_trigger_parent, task_id=trigger_child_dag,
run_id=manual__2026-03-05T07:20:17.339068+00:00, map_index=-1,
run_start_date=2026-03-05 07:20:23.231694+00:00, run_end_date=2026-03-05
07:21:34.289648+00:00, run_duration=71.057954, state=failed,
executor=KubernetesExecutor(parallelism=32), executor_state=failed,
try_number=1, max_tries=2, pool=default_pool, queue=default, priority_weight=1,
operator=TriggerDagRunOperator, queued_dttm=2026-03-05 07:20:17.775473+00:00,
scheduled_dttm=2026-03-05 07:20:17.767814+00:00,queued_by_job_id=4, pid=18
[airflow.jobs.scheduler_job_runner.SchedulerJobRunner]
loc=scheduler_job_runner.py:868
```
(note try_number=1, max_tries=2). but it's too late as the parent run was
already marked as failed, so it won't retry?
### Operating System
Linux
### Versions of Apache Airflow Providers
_No response_
### Deployment
Official Apache Airflow Helm Chart
### Deployment details
_No response_
### Anything else?
_No response_
### Are you willing to submit PR?
- [ ] Yes I am willing to submit a PR!
### Code of Conduct
- [x] I agree to follow this project's [Code of
Conduct](https://github.com/apache/airflow/blob/main/CODE_OF_CONDUCT.md)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]