GitHub user zacharyloeffler-creator edited a discussion:
DagFileProcessorManager Death Spiral and Defunct Scheduler Processes
### Description
We are experiencing a persistent "death spiral" where the
DagFileProcessorManager fails heartbeats and enters a restart loop. Scheduler
processes become immune to `SIGTERM `and even `SIGKILL`, which eventually leads
to a build-up of defunct processes and scheduler failure. Our environment then
becomes inoperable because DAGs cannot run and our code base will not parse.
The scheduler processes stay in the process table as defunct or in an
uninterruptible sleep state. Our theory is that the interaction between
top-level Python code and the shared filesystem can cause an issue at the OS
level, preventing these scheduler processes from being killed gracefully.
However, we have not been able to successfully reproduce the issue on demand to
identify a singular cause.
### Environment
- Airflow Version: 2.9.3
- Deployment: OpenShift / Kubernetes (Bitnami Chart)
- Storage: NFS Persistent Storage Volumes
### Key Log Signature
Based on the scheduler logs, the following shows where we identified when the
DagFileProcessorManager "death-spiral" actually starts
```
[2026-03-06T10:07:35.383+0000] {manager.py:285) ERROR - DagFileProcessorManager
(PID=3027916) last sent a heartbeat 50.90 seconds ago! Restarting it [cite: 15]
[2026-03-06T10:07:35.384+0000] {process_utils.py:132} INFO - Sending 15 to
group 3027916. PIDs of all processes in the group: [3033769, 3033770, 3027916]
[cite: 16, 17]
[2026-03-06T10:07:35.385+0000] {process_utils.py:87} INFO - Sending the signal
15 to group 3027916 [cite: 18]
[2026-03-06T10:08:35.387+0000] {process_utils.py:150} WARNING - process
psutil.Process(pid=3033770, name='airflow schedul', status='zombie',
started='10:05:54') did not respond to SIGTERM. Trying SIGKILL [cite: 19]
[2026-03-06T10:08:35.389+0000] {process_utils.py:150} WARNING - process
psutil.Process(pid=3027916, name='airflow scheduler DagFileProcessorManager',
status='disk-sleep', started='10:03:14') did not respond to SIGTERM. Trying
SIGKILL [cite: 20, 21]
[2026-03-06T10:08:35.389+0000] {process_utils.py:150} WARNING - process
psutil.Process(pid=3033769, name='airflow schedul', status='zombie',
started='10:05:54') did not respond to SIGTERM. Trying SIGKILL [cite: 22]
[2026-03-06T10:08:35.390+0000] {process_utils.py:87} INFO - Sending the signal
9 to group 3027916 [cite: 23]
[2026-03-06T10:09:35.398+0000] {process_utils.py:161) ERROR - Process
psutil.Process(pid=3033770, name='airflow schedul', status='zombie',
started='10:05:54') (3033770) could not be killed. Giving up. [cite: 24]
[2026-03-06T10:09:35.399+0000] {process_utils.py:161) ERROR - Process
psutil.Process(pid=3027916, name='airflow scheduler --
DagFileProcessorManager', status='disk-sleep', started='10:03:14') (3027916)
could not be killed. Giving up. [cite: 25, 26]
[2026-03-06T10:09:35.406+0000] {manager.py:170} INFO - Launched
DagFileProcessorManager with pid: 3034098 [cite: 27]
[2026-03-06T10:10:25.502+0000] {manager.py:285} ERROR - DagFileProcessor
Manager (PID=3034098) last sent a heartbeat 50.10 seconds ago! Restarting it
[cite: 32]
```
Additionally, shown are some of the defunct scheduler processes we are seeing:
<img width="1395" height="780" alt="ps table"
src="https://github.com/user-attachments/assets/1a830acd-f8f6-49fe-a722-6a94355b619b"
/>
### Code Pattern
We believe these are the DAGs responsible for the issue, because in both cases
we've seen it is appearing after these DAGs are scheduled/executed. Code
snippet of our DAG definition file has been redacted a bit, but the logic is
identical to our setup.
This loop is generating 4 DAGs based on the yaml config file, and we are not
importing any heavy libraries directly into the dag definition file.
```
os.environ["DLT_PROJECT_DIR"] = full_project_path
with open(config_path, "r") as f:
config = yaml.safe_load(f)
default_args = {
"owner": "",
"start_date": datetime(2026, 2, 13),
"depends_on_past": False
}
for pipe_cfg in config['pipelines']:
dag_id = f"{pipe_cfg['name']}"
datasource_name = ''
task_id = pipe_cfg['name']
dag_id = datasource_name + f'_{task_id}'
with DAG(
dag_id=dag_id,
default_args=default_args,
tags=[],
schedule='@hourly',
max_active_runs=1,
catchup=False
) as dag:
run_dlt = PythonOperator(
task_id="run_pipeline",
python_callable=run_pipeline,
op_kwargs={
"pipeline_name": pipe_cfg['name'],
"resource_list": pipe_cfg['resources'],
'pipeline_data_folder': pipeline_data_folder,
'cert_path': netscope_cert_path,
},
on_failure_callback=google_utils.task_fail_alert
)
trigger_flow= trigger_flow_task.partial(
cert_path=os.environ.get("OUR_CERT_PATH")
).expand_kwargs(pipe_cfg['urls'])
run_dlt >> trigger_flow
if pipe_cfg.get('trigger_metadata'):
metadata_task = trigger_flow_task.partial(
cert_path=os.environ.get("OUR_CERT_PATH")
).expand_kwargs(config['metadata_reporting']['callback_urls'])
trigger_flow>> metadata_task
globals()[dag_id] = dag
```
### Troubleshooting Performed
We isolated the dynamic DAGs into a separate, clean Airflow environment, and
the issue did not immediately re-occur.
### Questions
- What could we possibly be doing to cause this? Is there any chance this could
be caused by the dlt library? DLT is imported in our main_module containing
primarily pipeline logic, not the DAG definition file
- Are there known issues with dynamic DAG generation and os.environ or
top-level yaml.safe_load triggering kernel-level I/O blocks on NFS?
GitHub link: https://github.com/apache/airflow/discussions/63749
----
This is an automatically sent email for [email protected].
To unsubscribe, please send an email to: [email protected]