GitHub user zacharyloeffler-creator edited a discussion: 
DagFileProcessorManager Death Spiral and Defunct Scheduler Processes

### Description
We are experiencing a persistent "death spiral" where the 
DagFileProcessorManager fails heartbeats and enters a restart loop. Scheduler 
processes become immune to `SIGTERM `and even `SIGKILL`, which eventually leads 
to a build-up of defunct processes and scheduler failure. Our environment then 
becomes inoperable because DAGs cannot run and our code base will not parse.

The scheduler processes stay in the process table as defunct or in an 
uninterruptible sleep state. Our theory is that the interaction between 
top-level Python code and the shared filesystem can cause an issue at the OS 
level, preventing these scheduler processes from being killed gracefully. 
However, we have not been able to successfully reproduce the issue on demand to 
identify a singular cause.

### Environment
- Airflow Version: 2.9.3 
- Deployment: OpenShift / Kubernetes (Bitnami Chart) 
- Storage: NFS Persistent Storage Volumes 

### Key Log Signature
Based on the scheduler logs, the following shows where we identified when the 
DagFileProcessorManager "death-spiral" actually starts
```
[2026-03-06T10:07:35.383+0000] {manager.py:285) ERROR - DagFileProcessorManager 
(PID=3027916) last sent a heartbeat 50.90 seconds ago! Restarting it [cite: 15]
[2026-03-06T10:07:35.384+0000] {process_utils.py:132} INFO - Sending 15 to 
group 3027916. PIDs of all processes in the group: [3033769, 3033770, 3027916] 
[cite: 16, 17]
[2026-03-06T10:07:35.385+0000] {process_utils.py:87} INFO - Sending the signal 
15 to group 3027916 [cite: 18]
[2026-03-06T10:08:35.387+0000] {process_utils.py:150} WARNING - process 
psutil.Process(pid=3033770, name='airflow schedul', status='zombie', 
started='10:05:54') did not respond to SIGTERM. Trying SIGKILL [cite: 19]
[2026-03-06T10:08:35.389+0000] {process_utils.py:150} WARNING - process 
psutil.Process(pid=3027916, name='airflow scheduler DagFileProcessorManager', 
status='disk-sleep', started='10:03:14') did not respond to SIGTERM. Trying 
SIGKILL [cite: 20, 21]
[2026-03-06T10:08:35.389+0000] {process_utils.py:150} WARNING - process 
psutil.Process(pid=3033769, name='airflow schedul', status='zombie', 
started='10:05:54') did not respond to SIGTERM. Trying SIGKILL [cite: 22]
[2026-03-06T10:08:35.390+0000] {process_utils.py:87} INFO - Sending the signal 
9 to group 3027916 [cite: 23]
[2026-03-06T10:09:35.398+0000] {process_utils.py:161) ERROR - Process 
psutil.Process(pid=3033770, name='airflow schedul', status='zombie', 
started='10:05:54') (3033770) could not be killed. Giving up. [cite: 24]
[2026-03-06T10:09:35.399+0000] {process_utils.py:161) ERROR - Process 
psutil.Process(pid=3027916, name='airflow scheduler -- 
DagFileProcessorManager', status='disk-sleep', started='10:03:14') (3027916) 
could not be killed. Giving up. [cite: 25, 26]
[2026-03-06T10:09:35.406+0000] {manager.py:170} INFO - Launched 
DagFileProcessorManager with pid: 3034098 [cite: 27]
[2026-03-06T10:10:25.502+0000] {manager.py:285} ERROR - DagFileProcessor 
Manager (PID=3034098) last sent a heartbeat 50.10 seconds ago! Restarting it 
[cite: 32]
```
Additionally, shown are some of the defunct scheduler processes we are seeing:
<img width="1395" height="780" alt="ps table" 
src="https://github.com/user-attachments/assets/1a830acd-f8f6-49fe-a722-6a94355b619b";
 />


### Code Pattern
We believe these are the DAGs responsible for the issue, because in both cases 
we've seen it is appearing after these DAGs are scheduled/executed.  Code 
snippet of our DAG definition file has been redacted a bit, but the logic is 
identical to our setup.

This loop is generating 4 DAGs based on the yaml config file, and we are not 
importing any heavy libraries directly into the DAG definition file.

```
os.environ["DLT_PROJECT_DIR"] = full_project_path

with open(config_path, "r") as f:
    config = yaml.safe_load(f)

default_args = {
    "owner": "",
    "start_date": datetime(2026, 2, 13),
    "depends_on_past": False
}

for pipe_cfg in config['pipelines']:
    dag_id = f"{pipe_cfg['name']}"

    datasource_name = ''
    task_id = pipe_cfg['name']
    dag_id = datasource_name + f'_{task_id}'

    with DAG(
        dag_id=dag_id,
        default_args=default_args,
        tags=[],
        schedule='@hourly',
        max_active_runs=1,
        catchup=False
    ) as dag:
        run_dlt = PythonOperator(
            task_id="run_pipeline",
            python_callable=run_pipeline,
            op_kwargs={
                "pipeline_name": pipe_cfg['name'],
                "resource_list": pipe_cfg['resources'],
                'pipeline_data_folder': pipeline_data_folder,
                'cert_path': cert_path,
            },
            on_failure_callback=google_utils.task_fail_alert
        )

        trigger_flow= trigger_flow_task.partial(
            cert_path=os.environ.get("OUR_CERT_PATH")
        ).expand_kwargs(pipe_cfg['urls'])

        run_dlt >> trigger_flow

        if pipe_cfg.get('trigger_metadata'):
            metadata_task = trigger_flow_task.partial(
                cert_path=os.environ.get("OUR_CERT_PATH")
            ).expand_kwargs(config['metadata_reporting']['callback_urls'])

            trigger_flow>> metadata_task

    globals()[dag_id] = dag
```

### Troubleshooting Performed
We isolated the dynamic DAGs into a separate, clean Airflow environment, and 
the issue did not immediately re-occur.

### Questions
- What could we possibly be doing to cause this? Is there any chance this could 
be caused by the dlt library?  DLT is imported in our main_module containing 
primarily pipeline logic, not the DAG definition file
- Are there known issues with dynamic DAG generation and os.environ or 
top-level yaml.safe_load triggering kernel-level I/O blocks on NFS? 

GitHub link: https://github.com/apache/airflow/discussions/63749

----
This is an automatically sent email for [email protected].
To unsubscribe, please send an email to: [email protected]

Reply via email to