ft2898 opened a new issue, #51640:
URL: https://github.com/apache/airflow/issues/51640
### Apache Airflow version
Other Airflow 2 version (please specify below)
### If "Other Airflow 2 version" selected, which one?
2.10.5
### What happened?
I'm encountering an issue with Apache Airflow 2.10.5 where the scheduler
crashes with a TypeError. This issue happens when it detects a zombie task and
tries to process it. The scheduler logs and stack trace indicate that the error
occurred due to a NoneType value for end_date in the next_retry_datetime method
of TaskInstance.
Here are the relevant logs and traceback:
> [2025-06-12T01:43:19.317+0800] {scheduler_job_runner.py:2110} ERROR -
Detected zombie job:
{'full_filepath': '/data/airflow/dags/analysis_hourly.py',
'processor_subdir': '/data/airflow/dags',
'msg': "{'DAG Id': 'analysis_hourly', 'Task Id':
'analysis_hourly.tableau-235976', 'Run Id':
'scheduled__2025-06-11T16:00:00+00:00', 'Hostname':
'centos-hadoop3dn-480896.intsig.internal', 'External Executor Id':
'52b249a2-d690-4713-a3b3-1b2e8d72305b'}",
'simple_task_instance': SimpleTaskInstance(dag_id='analysis_hourly',
task_id='analysis_hourly.tableau-235976',
run_id='scheduled__2025-06-11T16:00:00+00:00', map_index=-1,
start_date=datetime.datetime(2025, 6, 11, 17, 34, 33, 41895,
tzinfo=Timezone('UTC')), end_date=None, try_number=1, state='running',
executor=None, executor_config={}, run_as_user=None, pool='default_pool',
priority_weight=2, queue='worker_03',
key=TaskInstanceKey(dag_id='analysis_hourly',
task_id='analysis_hourly.tableau-235976',
run_id='scheduled__2025-06-11T16:00:00+00:00', try_number=1, map_index=-1)),
'task_callback_type': None}
Full traceback:
> File
"/data/miniconda/envs/py311/lib/python3.11/site-packages/airflow/models/taskinstance.py",
line 2601, in are_dependencies_met
for dep_status in self.get_failed_dep_statuses(dep_context=dep_context,
session=session):
File
"/data/miniconda/envs/py311/lib/python3.11/site-packages/airflow/models/taskinstance.py",
line 2625, in get_failed_dep_statuses
for dep_status in dep.get_dep_statuses(self, session, dep_context):
File
"/data/miniconda/envs/py311/lib/python3.11/site-packages/airflow/ti_deps/deps/base_ti_dep.py",
line 115, in get_dep_statuses
yield from self._get_dep_statuses(ti, session, cxt)
File
"/data/miniconda/envs/py311/lib/python3.11/site-packages/airflow/ti_deps/deps/not_in_retry_period_dep.py",
line 48, in _get_dep_statuses
next_task_retry_date = ti.next_retry_datetime()
^^^^^^^^^^^^^^^^^^^^^^^^
File
"/data/miniconda/envs/py311/lib/python3.11/site-packages/airflow/models/taskinstance.py",
line 2685, in next_retry_datetime
return self.end_date + delay
~~~~~~~~~~~~~~^~~~~~~
TypeError: unsupported operand type(s) for +: 'NoneType' and
'datetime.timedelta'
From the logs, it seems that end_date in the TaskInstance object is None,
causing the crash when a zombie task is being processed.
Steps to Reproduce:
1. Schedule a DAG with tasks that may encounter retries or fail conditions.
2. Observe logs where the scheduler detects zombie tasks (ERROR - Detected
zombie job).
3. Scheduler crashes with the above traceback.
Expected Behavior: The scheduler should handle zombie tasks gracefully
without crashing.
Actual Behavior: The scheduler crashes due to a TypeError in
TaskInstance.next_retry_datetime when end_date is None.
Environment:
1. Airflow version: 2.10.5
2. Python version: 3.11
3. Database backend: MySQL 8.0
4. Executor: Celery-based worker
5. OS: CentOS7.9
6. DAG configuration: Includes retries and uses default_pool.
Additional Context:
- I suspect that the issue happens when TaskInstance.end_date is None during
processing of zombie tasks.
- This only seems to occur under specific conditions, such as failed or
zombie tasks.
- Link to relevant documentation mentioning zombie tasks
[here](https://airflow.apache.org/docs/apache-airflow/stable/core-concepts/tasks.html#zombie-undead-tasks).
- This issue did not happen in earlier versions (e.g., 1.10.x).
Let me know if additional logs, configurations, or DAG definitions are
needed to investigate further.
### What you think should happen instead?
_No response_
### How to reproduce
Unfortunately, I have not been able to clearly identify the exact steps to
reproduce the issue. The problem appears sporadically in my environment under
the following conditions:
A DAG is scheduled with tasks that include retries in their configuration.
A task runs and encounters some form of failure, potentially causing zombie
tasks to appear.
The scheduler detects zombie tasks (ERROR - Detected zombie job) and
subsequently crashes due to a TypeError.
### Operating System
NAME="CentOS Linux" VERSION="7 (Core)" ID="centos" ID_LIKE="rhel fedora"
VERSION_ID="7" PRETTY_NAME="CentOS Linux 7 (Core)" ANSI_COLOR="0;31"
CPE_NAME="cpe:/o:centos:centos:7" HOME_URL="https://www.centos.org/"
BUG_REPORT_URL="https://bugs.centos.org/" CENTOS_MANTISBT_PROJECT="CentOS-7"
CENTOS_MANTISBT_PROJECT_VERSION="7" REDHAT_SUPPORT_PRODUCT="centos"
REDHAT_SUPPORT_PRODUCT_VERSION="7"
### Versions of Apache Airflow Providers
_No response_
### Deployment
Virtualenv installation
### Deployment details
_No response_
### Anything else?
_No response_
### Are you willing to submit PR?
- [ ] Yes I am willing to submit a PR!
### Code of Conduct
- [x] I agree to follow this project's [Code of
Conduct](https://github.com/apache/airflow/blob/main/CODE_OF_CONDUCT.md)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]